2025-05-23-12-07
Logic-of-Thought: Empowering Large Language Models with Logic Programs for Solving Puzzles in Natural Language
Abstract
arXiv:2505.16114v1 Announce Type: new Abstract: Solving puzzles in natural language poses a long-standing challenge in AI. While large language models (LLMs) have recently shown impressive capabilities in a variety of tasks, they continue to struggle with complex puzzles that demand precise reasoning and exhaustive search. In this paper, we propose Logic-of-Thought (Logot), a novel framework that bridges LLMs with logic programming to address this problem. Our method leverages LLMs to translate puzzle rules and states into answer set programs (ASPs), the solution of which are then accurately and efficiently inferred by an ASP interpreter. This hybrid approach combines the natural language understanding of LLMs with the precise reasoning capabilities of logic programs. We evaluate our method on various grid puzzles and dynamic puzzles involving actions, demonstrating near-perfect accuracy across all tasks. Our code and data are available at: https://github.com/naiqili/Logic-of-Thought.
摘要
解决自然语言中的谜题是人工智能领域一项长期存在的挑战。尽管大型语言模型(LLM)近期在各类任务中展现出卓越性能,但其在需要精确推理和穷尽搜索的复杂谜题上仍存在困难。本文提出"逻辑思维"(Logot)这一创新框架,通过将LLM与逻辑编程相结合来解决该问题。我们的方法利用LLM将谜题规则和状态转换为答案集程序(ASP),随后由ASP解释器进行准确高效的推理求解。这种混合方法融合了LLM的自然语言理解能力与逻辑程序的精确推理优势。我们在多种网格谜题和涉及动作的动态谜题上评估本方法,所有任务均展现出接近完美的准确率。代码与数据详见:https://github.com/naiqili/Logic-of-Thought。
SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution
Abstract
arXiv:2505.16048v1 Announce Type: new Abstract: We introduce a novel dataset designed to benchmark the physical and spatial reasoning capabilities of Large Language Models (LLM) based on topology optimization, a method for computing optimal material distributions within a design space under prescribed loads and supports. In this dataset, LLMs are provided with conditions such as 2D boundary, applied forces and supports, and must reason about the resulting optimal material distribution. The dataset includes a variety of tasks, ranging from filling in masked regions within partial structures to predicting complete material distributions. Solving these tasks requires understanding the flow of forces and the required material distribution under given constraints, without access to simulation tools or explicit physical models, challenging models to reason about structural stability and spatial organization. Our dataset targets the evaluation of spatial and physical reasoning abilities in 2D settings, offering a complementary perspective to traditional language and logic benchmarks.
摘要
我们提出一个新颖的数据集,旨在基于拓扑优化方法评估大型语言模型(LLM)的物理与空间推理能力。该数据集通过给定二维边界、作用力及支撑条件,要求模型推理出最优材料分布。数据集包含多种任务类型,包括补全局部结构中的掩蔽区域,以及预测完整材料分布等。解决这些任务需要理解给定约束条件下的力流传递与材料分布需求,且不依赖仿真工具或显式物理模型,从而对模型的结构稳定性与空间组织推理能力形成挑战。本数据集专注于二维环境下的空间与物理推理能力评估,为传统语言和逻辑基准测试提供了补充性视角。
Causal LLM Routing: End-to-End Regret Minimization from Observational Data
Abstract
arXiv:2505.16037v1 Announce Type: new Abstract: LLM routing aims to select the most appropriate model for each query, balancing competing performance metrics such as accuracy and cost across a pool of language models. Prior approaches typically adopt a decoupled strategy, where the metrics are first predicted and the model is then selected based on these estimates. This setup is prone to compounding errors and often relies on full-feedback data, where each query is evaluated by all candidate models, which is costly to obtain and maintain in practice. In contrast, we learn from observational data, which records only the outcome of the model actually deployed. We propose a causal end-to-end framework that learns routing policies by minimizing decision-making regret from observational data. To enable efficient optimization, we introduce two theoretically grounded surrogate objectives: a classification-based upper bound, and a softmax-weighted regret approximation shown to recover the optimal policy at convergence. We further extend our framework to handle heterogeneous cost preferences via an interval-conditioned architecture. Experiments on public benchmarks show that our method outperforms existing baselines, achieving state-of-the-art performance across different embedding models.
摘要
大语言模型路由(LLM routing)旨在为每个查询选择最合适的模型,在语言模型池中平衡准确性与成本等竞争性性能指标。现有方法通常采用解耦策略,即先预测各项指标,再基于这些估计值选择模型。这种设置容易导致误差累积,且往往依赖全反馈数据(即每个查询需经所有候选模型评估),其获取和维护成本高昂。与之相反,我们利用观察数据(仅记录实际部署模型的输出结果)进行学习。本文提出一个因果端到端框架,通过最小化观察数据中的决策遗憾来学习路由策略。为实现高效优化,我们引入两个理论完备的替代目标:基于分类的上界,以及经证明能在收敛时恢复最优策略的softmax加权遗憾近似。我们进一步通过区间条件架构扩展框架以处理异构成本偏好。公开基准测试表明,本方法优于现有基线,在不同嵌入模型上均达到最先进性能。
Optimizing LLM-Based Multi-Agent System with Textual Feedback: A Case Study on Software Development
Abstract
arXiv:2505.16086v1 Announce Type: new Abstract: We have seen remarkable progress in large language models (LLMs) empowered multi-agent systems solving complex tasks necessitating cooperation among experts with diverse skills. However, optimizing LLM-based multi-agent systems remains challenging. In this work, we perform an empirical case study on group optimization of role-based multi-agent systems utilizing natural language feedback for challenging software development tasks under various evaluation dimensions. We propose a two-step agent prompts optimization pipeline: identifying underperforming agents with their failure explanations utilizing textual feedback and then optimizing system prompts of identified agents utilizing failure explanations. We then study the impact of various optimization settings on system performance with two comparison groups: online against offline optimization and individual against group optimization. For group optimization, we study two prompting strategies: one-pass and multi-pass prompting optimizations. Overall, we demonstrate the effectiveness of our optimization method for role-based multi-agent systems tackling software development tasks evaluated on diverse evaluation dimensions, and we investigate the impact of diverse optimization settings on group behaviors of the multi-agent systems to provide practical insights for future development.
摘要
我们观察到基于大语言模型(LLM)的多智能体系统在解决需要不同领域专家协作的复杂任务方面取得了显著进展。然而,LLM驱动的多智能体系统优化仍具挑战性。本研究通过实证案例,探讨了在软件开发任务中利用自然语言反馈对基于角色的多智能体系统进行群体优化的效果,并从多个评估维度展开分析。我们提出了一种两阶段的智能体提示优化流程:首先通过文本反馈识别表现欠佳的智能体及其失败原因,随后根据失败解释对已识别智能体的系统提示进行优化。通过设置在线与离线优化、个体与群体优化两组对比实验,我们研究了不同优化设置对系统性能的影响。在群体优化方面,我们比较了单轮提示与多轮提示两种优化策略。实验结果表明,该方法能有效提升基于角色的多智能体系统在软件开发任务中的表现,且在不同评估维度上均显示出优化效果。此外,我们还探究了不同优化设置对多智能体系统群体行为的影响,为未来研究提供了实践启示。
LLM-Powered AI Agent Systems and Their Applications in Industry
Abstract
arXiv:2505.16120v1 Announce Type: new Abstract: The emergence of Large Language Models (LLMs) has reshaped agent systems. Unlike traditional rule-based agents with limited task scope, LLM-powered agents offer greater flexibility, cross-domain reasoning, and natural language interaction. Moreover, with the integration of multi-modal LLMs, current agent systems are highly capable of processing diverse data modalities, including text, images, audio, and structured tabular data, enabling richer and more adaptive real-world behavior. This paper comprehensively examines the evolution of agent systems from the pre-LLM era to current LLM-powered architectures. We categorize agent systems into software-based, physical, and adaptive hybrid systems, highlighting applications across customer service, software development, manufacturing automation, personalized education, financial trading, and healthcare. We further discuss the primary challenges posed by LLM-powered agents, including high inference latency, output uncertainty, lack of evaluation metrics, and security vulnerabilities, and propose potential solutions to mitigate these concerns.
摘要
大型语言模型(LLMs)的出现重塑了智能体系统。与传统任务范围有限的基于规则的智能体不同,基于LLM的智能体具有更高的灵活性、跨领域推理能力和自然语言交互特性。此外,随着多模态LLM的整合,当前智能体系统能够高效处理包括文本、图像、音频和结构化表格数据在内的多种数据模态,从而实现更丰富且更具适应性的现实世界行为。本文系统考察了智能体系统从前LLM时代到当前基于LLM架构的演进历程,将智能体系统划分为软件型、物理型和自适应混合型三类,重点阐述了其在客户服务、软件开发、制造自动化、个性化教育、金融交易和医疗健康等领域的应用。我们进一步探讨了基于LLM的智能体面临的主要挑战,包括高推理延迟、输出不确定性、评估指标缺失和安全漏洞等问题,并提出了缓解这些问题的潜在解决方案。
TrialPanorama: Database and Benchmark for Systematic Review and Design of Clinical Trials
Abstract
arXiv:2505.16097v1 Announce Type: new Abstract: Developing artificial intelligence (AI) for vertical domains requires a solid data foundation for both training and evaluation. In this work, we introduce TrialPanorama, a large-scale, structured database comprising 1,657,476 clinical trial records aggregated from 15 global sources. The database captures key aspects of trial design and execution, including trial setups, interventions, conditions, biomarkers, and outcomes, and links them to standard biomedical ontologies such as DrugBank and MedDRA. This structured and ontology-grounded design enables TrialPanorama to serve as a unified, extensible resource for a wide range of clinical trial tasks, including trial planning, design, and summarization. To demonstrate its utility, we derive a suite of benchmark tasks directly from the TrialPanorama database. The benchmark spans eight tasks across two categories: three for systematic review (study search, study screening, and evidence summarization) and five for trial design (arm design, eligibility criteria, endpoint selection, sample size estimation, and trial completion assessment). The experiments using five state-of-the-art large language models (LLMs) show that while general-purpose LLMs exhibit some zero-shot capability, their performance is still inadequate for high-stakes clinical trial workflows. We release TrialPanorama database and the benchmark to facilitate further research on AI for clinical trials.
摘要
开发垂直领域人工智能(AI)需要建立坚实的训练与评估数据基础。本研究推出TrialPanorama——一个包含1,657,476条临床试验记录的大规模结构化数据库,这些记录聚合自全球15个数据源。该数据库完整捕获试验设计与执行的关键要素,包括试验方案、干预措施、适应症、生物标志物及结局指标,并将其与DrugBank、MedDRA等标准生物医学本体进行关联。这种基于本体的结构化设计使TrialPanorama能作为统一的、可扩展的资源平台,支持包括试验规划、设计与总结在内的多种临床试验任务。为验证其实用性,我们直接从TrialPanorama数据库衍生出一套基准测试任务,涵盖两大类别共八项任务:系统评价类(研究检索、研究筛选与证据总结)三项,试验设计类(分组设计、入排标准、终点选择、样本量估算与试验完成度评估)五项。采用五种前沿大语言模型(LLM)的实验表明,尽管通用LLM展现出一定的零样本能力,但其性能仍无法满足高风险的临床试验工作流程需求。我们公开TrialPanorama数据库及基准测试,以促进临床试验AI的深入研究。
Sudoku-Bench: Evaluating creative reasoning with Sudoku variants
Abstract
arXiv:2505.16135v1 Announce Type: new Abstract: Existing reasoning benchmarks for large language models (LLMs) frequently fail to capture authentic creativity, often rewarding memorization of previously observed patterns. We address this shortcoming with Sudoku-Bench, a curated benchmark of challenging and unconventional Sudoku variants specifically selected to evaluate creative, multi-step logical reasoning. Sudoku variants form an unusually effective domain for reasoning research: each puzzle introduces unique or subtly interacting constraints, making memorization infeasible and requiring solvers to identify novel logical breakthroughs (``break-ins''). Despite their diversity, Sudoku variants maintain a common and compact structure, enabling clear and consistent evaluation. Sudoku-Bench includes a carefully chosen puzzle set, a standardized text-based puzzle representation, and flexible tools compatible with thousands of publicly available puzzles -- making it easy to extend into a general research environment. Baseline experiments show that state-of-the-art LLMs solve fewer than 15% of puzzles unaided, highlighting significant opportunities to advance long-horizon, strategic reasoning capabilities.
摘要
现有针对大语言模型(LLM)的推理基准测试往往无法捕捉真正的创造力,通常仅奖励对已知模式的记忆。为弥补这一缺陷,我们提出Sudoku-Bench——一个精心设计的数独变体基准测试集,专门用于评估创造性、多步骤逻辑推理能力。数独变体构成了推理研究中异常有效的领域:每个谜题都包含独特或微妙互动的约束条件,使得记忆失效,并要求求解者发现新颖的逻辑突破口("破局点")。尽管具有多样性,数独变体仍保持着统一紧凑的结构,可实现清晰一致的评估。Sudoku-Bench包含精心挑选的谜题集、标准化的文本谜题表示法,以及与数千个公开谜题兼容的灵活工具,便于扩展为通用研究环境。基线实验表明,最先进的LLM在无辅助情况下仅能解决不足15%的谜题,这为推进长程战略推理能力提供了重要研究空间。
How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior
Abstract
arXiv:2505.16067v1 Announce Type: new Abstract: Memory is a critical component in large language model (LLM)-based agents, enabling them to store and retrieve past executions to improve task performance over time. In this paper, we conduct an empirical study on how memory management choices impact the LLM agents' behavior, especially their long-term performance. Specifically, we focus on two fundamental memory operations that are widely used by many agent frameworks-addition, which incorporates new experiences into the memory base, and deletion, which selectively removes past experiences-to systematically study their impact on the agent behavior. Through our quantitative analysis, we find that LLM agents display an experience-following property: high similarity between a task input and the input in a retrieved memory record often results in highly similar agent outputs. Our analysis further reveals two significant challenges associated with this property: error propagation, where inaccuracies in past experiences compound and degrade future performance, and misaligned experience replay, where outdated or irrelevant experiences negatively influence current tasks. Through controlled experiments, we show that combining selective addition and deletion strategies can help mitigate these negative effects, yielding an average absolute performance gain of 10% compared to naive memory growth. Furthermore, we highlight how memory management choices affect agents' behavior under challenging conditions such as task distribution shifts and constrained memory resources. Our findings offer insights into the behavioral dynamics of LLM agent memory systems and provide practical guidance for designing memory components that support robust, long-term agent performance. We also release our code to facilitate further study.
摘要
记忆是基于大语言模型(LLM)智能体的关键组件,使其能够存储和检索过往执行记录,从而随时间推移提升任务表现。本文通过实证研究探讨了记忆管理策略如何影响LLM智能体行为,尤其是其长期性能。我们重点研究了当前多数智能体框架广泛采用的两种基础记忆操作——添加(将新经验纳入记忆库)和删除(选择性移除过往经验)——系统分析其对智能体行为的影响。定量研究表明,LLM智能体表现出"经验跟随"特性:当任务输入与检索记忆记录的输入高度相似时,智能体输出往往也高度相似。分析进一步揭示了该特性引发的两大挑战:错误传播(过往经验中的错误累积导致未来性能下降)与错位经验回放(过时或无关经验对当前任务产生负面影响)。通过控制实验,我们发现结合选择性添加与删除策略能有效缓解这些负面效应,相比简单记忆增长策略平均可获得10%的绝对性能提升。此外,我们还阐明了在任务分布变化和内存资源受限等挑战条件下,记忆管理选择如何影响智能体行为。本研究揭示了LLM智能体记忆系统的行为动力学特征,为设计支持稳健长期性能的记忆组件提供了实践指导。我们同步公开代码以促进后续研究。
Can AI Read Between The Lines? Benchmarking LLMs On Financial Nuance
Abstract
arXiv:2505.16090v1 Announce Type: new Abstract: As of 2025, Generative Artificial Intelligence (GenAI) has become a central tool for productivity across industries. Beyond text generation, GenAI now plays a critical role in coding, data analysis, and research workflows. As large language models (LLMs) continue to evolve, it is essential to assess the reliability and accuracy of their outputs, especially in specialized, high-stakes domains like finance. Most modern LLMs transform text into numerical vectors, which are used in operations such as cosine similarity searches to generate responses. However, this abstraction process can lead to misinterpretation of emotional tone, particularly in nuanced financial contexts. While LLMs generally excel at identifying sentiment in everyday language, these models often struggle with the nuanced, strategically ambiguous language found in earnings call transcripts. Financial disclosures frequently embed sentiment in hedged statements, forward-looking language, and industry-specific jargon, making it difficult even for human analysts to interpret consistently, let alone AI models. This paper presents findings from the Santa Clara Microsoft Practicum Project, led by Professor Charlie Goldenberg, which benchmarks the performance of Microsoft's Copilot, OpenAI's ChatGPT, Google's Gemini, and traditional machine learning models for sentiment analysis of financial text. Using Microsoft earnings call transcripts, the analysis assesses how well LLM-derived sentiment correlates with market sentiment and stock movements and evaluates the accuracy of model outputs. Prompt engineering techniques are also examined to improve sentiment analysis results. Visualizations of sentiment consistency are developed to evaluate alignment between tone and stock performance, with sentiment trends analyzed across Microsoft's lines of business to determine which segments exert the greatest influence.
摘要
截至2025年,生成式人工智能(GenAI)已成为各行业生产力的核心工具。除文本生成外,GenAI目前在编程、数据分析和研究流程中发挥着关键作用。随着大语言模型(LLM)的持续演进,评估其输出结果的可靠性与准确性变得至关重要——尤其是在金融等专业高风险领域。现代主流LLM通常将文本转化为数值向量,通过余弦相似度搜索等操作生成响应。然而这种抽象化过程可能导致情感基调的误判,在微妙的金融语境中尤为明显。虽然LLM对日常语言的情感识别表现优异,但面对财报电话会议记录中具有战略模糊性的复杂语言时,这些模型往往表现欠佳。财务披露文件常将情感隐含于对冲陈述、前瞻性表述及行业特定术语中,即使人类分析师也难以保持一致性解读,AI模型则更为困难。本文呈现了由Charlie Goldenberg教授主持的圣克拉拉微软实践项目的研究成果,该项目对微软Copilot、OpenAI的ChatGPT、谷歌Gemini及传统机器学习模型在金融文本情感分析中的表现进行了基准测试。通过分析微软财报电话会议记录,研究评估了LLM推导的情感与市场情绪及股价波动的相关性,并检验了模型输出的准确性。研究还考察了提示词工程技术对改善情感分析效果的作用,开发了情感一致性可视化方案以评估语调与股票表现的匹配度,并通过分析微软各业务线的情感趋势来确定最具影响力的业务板块。
MAPS: A Multilingual Benchmark for Global Agent Performance and Security
Abstract
arXiv:2505.15935v1 Announce Type: new Abstract: Agentic AI systems, which build on Large Language Models (LLMs) and interact with tools and memory, have rapidly advanced in capability and scope. Yet, since LLMs have been shown to struggle in multilingual settings, typically resulting in lower performance and reduced safety, agentic systems risk inheriting these limitations. This raises concerns about the global accessibility of such systems, as users interacting in languages other than English may encounter unreliable or security-critical agent behavior. Despite growing interest in evaluating agentic AI, existing benchmarks focus exclusively on English, leaving multilingual settings unexplored. To address this gap, we propose MAPS, a multilingual benchmark suite designed to evaluate agentic AI systems across diverse languages and tasks. MAPS builds on four widely used agentic benchmarks - GAIA (real-world tasks), SWE-bench (code generation), MATH (mathematical reasoning), and the Agent Security Benchmark (security). We translate each dataset into ten diverse languages, resulting in 805 unique tasks and 8,855 total language-specific instances. Our benchmark suite enables a systematic analysis of how multilingual contexts affect agent performance and robustness. Empirically, we observe consistent degradation in both performance and security when transitioning from English to other languages, with severity varying by task and correlating with the amount of translated input. Building on these findings, we provide actionable recommendations to guide agentic AI systems development and assessment under multilingual settings. This work establishes a standardized evaluation framework, encouraging future research towards equitable, reliable, and globally accessible agentic AI. MAPS benchmark suite is publicly available at https://huggingface.co/datasets/Fujitsu-FRE/MAPS
摘要
基于大型语言模型(LLMs)并与工具及记忆系统交互的代理式人工智能系统,其能力与应用范围正快速发展。然而,由于LLMs已被证明在多语言环境中存在性能下降与安全性降低的问题,代理系统可能继承这些缺陷。这引发了对此类系统全球可访问性的担忧——使用非英语语言的用户可能会遇到不可靠或存在安全风险的代理行为。尽管对代理式AI评估的关注日益增长,现有基准测试仍仅聚焦英语环境,多语言场景尚未得到探索。为填补这一空白,我们提出MAPS:一个旨在评估多语言多任务场景下代理式AI系统的基准测试套件。MAPS基于四个广泛使用的代理基准构建——GAIA(现实世界任务)、SWE-bench(代码生成)、MATH(数学推理)和Agent Security Benchmark(安全性),将每个数据集翻译为十种不同语言,最终形成805项独特任务和8,855个特定语言实例。该套件支持系统分析多语言语境如何影响代理性能与鲁棒性。实证研究表明,从英语转换到其他语言时,性能与安全性均呈现一致性下降,其严重程度因任务而异并与翻译输入量相关。基于这些发现,我们提出可操作建议以指导多语言环境下的代理式AI系统开发与评估。本研究建立了标准化评估框架,推动未来研究向公平、可靠且全球可访问的代理式AI发展。MAPS基准套件公开于https://huggingface.co/datasets/Fujitsu-FRE/MAPS。
LightRouter: Towards Efficient LLM Collaboration with Minimal Overhead
Abstract
arXiv:2505.16221v1 Announce Type: new Abstract: The rapid advancement of large language models has unlocked remarkable capabilities across a diverse array of natural language processing tasks. However, the considerable differences among available LLMs-in terms of cost, performance, and computational demands-pose significant challenges for users aiming to identify the most suitable model for specific tasks. In this work, we present LightRouter, a novel framework designed to systematically select and integrate a small subset of LLMs from a larger pool, with the objective of jointly optimizing both task performance and cost efficiency. LightRouter leverages an adaptive selection mechanism to identify models that require only a minimal number of boot tokens, thereby reducing costs, and further employs an effective integration strategy to combine their outputs. Extensive experiments across multiple benchmarks demonstrate that LightRouter matches or outperforms widely-used ensemble baselines, achieving up to a 25% improvement in accuracy. Compared with leading high-performing models, LightRouter achieves comparable performance while reducing inference costs by up to 27%. Importantly, our framework operates without any prior knowledge of individual models and relies exclusively on inexpensive, lightweight models. This work introduces a practical approach for efficient LLM selection and provides valuable insights into optimal strategies for model combination.
摘要
大型语言模型的快速发展使其在各类自然语言处理任务中展现出卓越能力。然而现有模型在成本、性能和计算需求方面的显著差异,为用户选择特定任务的最优模型带来了重大挑战。本研究提出LightRouter框架,该系统能从大规模模型池中智能筛选并整合少量模型,协同优化任务性能与成本效益。LightRouter采用自适应选择机制识别仅需极少量启动标记的模型以降低成本,并通过高效集成策略融合各模型输出。跨多个基准的广泛实验表明,LightRouter达到或超越主流集成基线方法,最高可实现25%的准确率提升。与顶尖高性能模型相比,本框架在保持相当性能的同时,最高可降低27%的推理成本。值得注意的是,该框架无需任何先验模型知识,仅依赖轻量级廉价模型即可运行。本研究为高效选择语言模型提供了实用方案,并为最优模型组合策略提供了重要见解。
MAPLE: Many-Shot Adaptive Pseudo-Labeling for In-Context Learning
Abstract
arXiv:2505.16225v1 Announce Type: new Abstract: In-Context Learning (ICL) empowers Large Language Models (LLMs) to tackle diverse tasks by incorporating multiple input-output examples, known as demonstrations, into the input of LLMs. More recently, advancements in the expanded context windows of LLMs have led to many-shot ICL, which uses hundreds of demonstrations and outperforms few-shot ICL, which relies on fewer examples. However, this approach is often hindered by the high cost of obtaining large amounts of labeled data. To address this challenge, we propose Many-Shot Adaptive Pseudo-LabEling, namely MAPLE, a novel influence-based many-shot ICL framework that utilizes pseudo-labeled samples to compensate for the lack of label information. We first identify a subset of impactful unlabeled samples and perform pseudo-labeling on them by querying LLMs. These pseudo-labeled samples are then adaptively selected and tailored to each test query as input to improve the performance of many-shot ICL, without significant labeling costs. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework, showcasing its ability to enhance LLM adaptability and performance with limited labeled data.
摘要
上下文学习(ICL)通过将多个输入-输出示例(即演示样本)整合到大型语言模型(LLMs)的输入中,使其能够处理多样化任务。近期,随着LLMs上下文窗口的扩展,出现了利用数百个演示样本的多样本ICL,其性能优于依赖少量示例的少样本ICL。然而,该方法常受限于获取大量标注数据的高成本。为解决这一挑战,我们提出基于影响力的多样本自适应伪标注框架MAPLE,通过利用伪标注样本弥补标签信息的不足。该框架首先识别具有影响力的未标注样本子集,并通过查询LLMs对其进行伪标注。这些伪标注样本随后被自适应地筛选并针对每个测试查询定制化输入,从而在不显著增加标注成本的前提下提升多样本ICL的性能。在真实数据集上的大量实验验证了本框架的有效性,展示了其在有限标注数据条件下增强LLM适应性与性能的能力。
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
Abstract
arXiv:2505.16186v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering, leading to remarkable improvements in complex tasks. However, they pose great safety risks against harmful queries and adversarial attacks. While recent mainstream safety efforts on LRMs, supervised fine-tuning (SFT), improve safety performance, we find that SFT-aligned models struggle to generalize to unseen jailbreak prompts. After thorough investigation of LRMs' generation, we identify a safety aha moment that can activate safety reasoning and lead to a safe response. This aha moment typically appears in the `key sentence', which follows models' query understanding process and can indicate whether the model will proceed safely. Based on these insights, we propose SafeKey, including two complementary objectives to better activate the safety aha moment in the key sentence: (1) a Dual-Path Safety Head to enhance the safety signal in the model's internal representations before the key sentence, and (2) a Query-Mask Modeling objective to improve the models' attention on its query understanding, which has important safety hints. Experiments across multiple safety benchmarks demonstrate that our methods significantly improve safety generalization to a wide range of jailbreak attacks and out-of-distribution harmful prompts, lowering the average harmfulness rate by 9.6%, while maintaining general abilities. Our analysis reveals how SafeKey enhances safety by reshaping internal attention and improving the quality of hidden representations.
摘要
大型推理模型(LRMs)引入了一种"先推理后回答"的新范式,在复杂任务中实现了显著性能提升。然而这类模型在面对恶意查询和对抗攻击时存在重大安全隐患。尽管当前主流的安全对齐方法——监督微调(SFT)能提升模型安全性,我们发现经SFT对齐的模型对未见过的越狱提示泛化能力不足。通过对LRMs生成过程的深入研究,我们识别出能够激活安全推理并产生安全响应的"安全顿悟时刻"。该时刻通常出现在"关键句"中,这些句子紧随模型的查询理解过程,可预示模型后续行为是否安全。基于这些发现,我们提出SafeKey框架,包含两个互补目标以更好激活关键句中的安全顿悟:(1) 双路径安全头模块——增强关键句前模型内部表征的安全信号;(2) 查询掩码建模目标——提升模型对包含重要安全线索的查询理解过程的关注度。在多个安全基准测试中,我们的方法显著提升了对各类越狱攻击和分布外有害提示的安全泛化能力,平均危害率降低9.6%,同时保持通用能力。分析表明SafeKey通过重塑内部注意力机制和提升隐含表征质量来增强安全性。
Losing is for Cherishing: Data Valuation Based on Machine Unlearning and Shapley Value
Abstract
arXiv:2505.16147v1 Announce Type: new Abstract: The proliferation of large models has intensified the need for efficient data valuation methods to quantify the contribution of individual data providers. Traditional approaches, such as game-theory-based Shapley value and influence-function-based techniques, face prohibitive computational costs or require access to full data and model training details, making them hardly achieve partial data valuation. To address this, we propose Unlearning Shapley, a novel framework that leverages machine unlearning to estimate data values efficiently. By unlearning target data from a pretrained model and measuring performance shifts on a reachable test set, our method computes Shapley values via Monte Carlo sampling, avoiding retraining and eliminating dependence on full data. Crucially, Unlearning Shapley supports both full and partial data valuation, making it scalable for large models (e.g., LLMs) and practical for data markets. Experiments on benchmark datasets and large-scale text corpora demonstrate that our approach matches the accuracy of state-of-the-art methods while reducing computational overhead by orders of magnitude. Further analysis confirms a strong correlation between estimated values and the true impact of data subsets, validating its reliability in real-world scenarios. This work bridges the gap between data valuation theory and practical deployment, offering a scalable, privacy-compliant solution for modern AI ecosystems.
摘要
大型模型的激增使得对高效数据估值方法的需求日益迫切,以量化个体数据提供者的贡献。传统方法如基于博弈论的Shapley值和基于影响函数的技术,面临着极高的计算成本或需要获取完整数据及模型训练细节,导致其难以实现局部数据估值。为此,我们提出"遗忘Shapley"——一种创新框架,通过利用机器学习遗忘机制来高效估算数据价值。该方法通过从预训练模型中遗忘目标数据,并在可达测试集上测量性能变化,借助蒙特卡洛采样计算Shapley值,从而避免模型重训练并消除对完整数据的依赖。关键的是,"遗忘Shapley"同时支持完整和局部数据估值,使其能够适用于大型模型(如大语言模型)并满足数据市场的实际需求。在基准数据集和大规模文本语料上的实验表明,我们的方法在保持与最先进技术相当精度的同时,将计算开销降低了数个数量级。进一步分析证实,估算值与数据子集的真实影响之间存在强相关性,验证了其在实际场景中的可靠性。本研究弥合了数据估值理论与实际应用之间的鸿沟,为现代AI生态系统提供了可扩展且符合隐私要求的解决方案。
Dynamic Sampling that Adapts: Iterative DPO for Self-Aware Mathematical Reasoning
Abstract
arXiv:2505.16176v1 Announce Type: new Abstract: In the realm of data selection for reasoning tasks, existing approaches predominantly rely on externally predefined static metrics such as difficulty and diversity, which are often designed for supervised fine-tuning (SFT) and lack adaptability to continuous training processes. A critical limitation of these methods is their inability to dynamically align with the evolving capabilities of models during online training, a gap that becomes increasingly pronounced with the rise of dynamic training paradigms and online reinforcement learning (RL) frameworks (e.g., R1 models). To address this, we introduce SAI-DPO, an algorithm that dynamically selects training data by continuously assessing a model's stage-specific reasoning abilities across different training phases. By integrating real-time model performance feedback, SAI-DPO adaptively adapts data selection to the evolving strengths and weaknesses of the model, thus enhancing both data utilization efficiency and final task performance. Extensive experiments on three state-of-the-art models and eight mathematical reasoning benchmarks, including challenging competition-level datasets (e.g., AIME24 and AMC23), demonstrate that SAI-DPO achieves an average performance boost of up to 21.3 percentage points, with particularly notable improvements of 10 and 15 points on AIME24 and AMC23, respectively. These results highlight the superiority of dynamic, model-adaptive data selection over static, externally defined strategies in advancing reasoning.
摘要
在面向推理任务的数据选择领域,现有方法主要依赖于外部预定义的静态指标(如难度和多样性),这些指标通常是为监督微调(SFT)设计的,缺乏对持续训练过程的适应性。这些方法的关键局限在于无法与模型在线训练时动态演进的能力保持同步,这一缺陷随着动态训练范式与在线强化学习(RL)框架(如R1模型)的兴起而日益凸显。为此,我们提出SAI-DPO算法,该算法通过持续评估模型在不同训练阶段特有的推理能力来实现动态数据选择。通过整合实时模型性能反馈,SAI-DPO能自适应地根据模型动态变化的优劣势调整数据选择策略,从而同时提升数据利用效率和最终任务表现。在三个前沿模型和八个数学推理基准(包括AIME24、AMC23等竞赛级高难度数据集)上的大量实验表明,SAI-DPO平均可获得高达21.3个百分点的性能提升,其中在AIME24和AMC23上分别取得10分和15分的显著改进。这些结果充分证明,相较于静态的外部定义策略,动态的模型自适应数据选择方法在推进推理能力方面具有显著优势。
No Black Boxes: Interpretable and Interactable Predictive Healthcare with Knowledge-Enhanced Agentic Causal Discovery
Abstract
arXiv:2505.16288v1 Announce Type: new Abstract: Deep learning models trained on extensive Electronic Health Records (EHR) data have achieved high accuracy in diagnosis prediction, offering the potential to assist clinicians in decision-making and treatment planning. However, these models lack two crucial features that clinicians highly value: interpretability and interactivity. The ``black-box'' nature of these models makes it difficult for clinicians to understand the reasoning behind predictions, limiting their ability to make informed decisions. Additionally, the absence of interactive mechanisms prevents clinicians from incorporating their own knowledge and experience into the decision-making process. To address these limitations, we propose II-KEA, a knowledge-enhanced agent-driven causal discovery framework that integrates personalized knowledge databases and agentic LLMs. II-KEA enhances interpretability through explicit reasoning and causal analysis, while also improving interactivity by allowing clinicians to inject their knowledge and experience through customized knowledge bases and prompts. II-KEA is evaluated on both MIMIC-III and MIMIC-IV, demonstrating superior performance along with enhanced interpretability and interactivity, as evidenced by its strong results from extensive case studies.
摘要
基于大规模电子健康记录(EHR)数据训练的深度学习模型在诊断预测方面已实现高精度,为辅助临床医生决策和治疗规划提供了可能。然而,这些模型缺乏临床医生高度重视的两个关键特性:可解释性与交互性。模型的"黑箱"特性使临床医生难以理解预测背后的逻辑,限制了其做出知情决策的能力。此外,交互机制的缺失阻碍了临床医生将自身知识与经验融入决策过程。为解决这些局限,我们提出II-KEA——一个整合个性化知识库与智能体大语言模型的知识增强型智能体驱动因果发现框架。II-KEA通过显式推理与因果分析提升可解释性,同时允许临床医生通过定制化知识库和提示词注入其知识经验以增强交互性。在MIMIC-III和MIMIC-IV数据集上的评估表明,II-KEA不仅表现出卓越性能,其增强的可解释性与交互性也得到广泛案例研究的有力验证。
EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action Pruning
Abstract
arXiv:2505.16312v1 Announce Type: new Abstract: Large Language Models (LLMs) excel at complex reasoning through search algorithms, yet current strategies often suffer from massive token consumption due to redundant exploration of semantically equivalent steps. Existing semantic similarity methods struggle to accurately identify such equivalence in domain-specific contexts like mathematical reasoning. To address this, we propose EquivPruner, a simple yet effective approach that identifies and prunes semantically equivalent actions during LLM reasoning search. We also introduce MathEquiv, the first dataset we created for mathematical statement equivalence, which enables the training of a lightweight equivalence detector. Extensive experiments across various models and tasks demonstrate that EquivPruner significantly reduces token consumption, improving searching efficiency and often bolstering reasoning accuracy. For instance, when applied to Qwen2.5-Math-7B-Instruct on GSM8K, EquivPruner reduced token consumption by 48.1% while also improving accuracy. Our code is available at https://github.com/Lolo1222/EquivPruner.
摘要
大语言模型(LLMs)通过搜索算法擅长复杂推理,但现有策略常因对语义等价步骤的冗余探索而导致大量标记消耗。现有语义相似性方法难以在数学推理等特定领域情境中准确识别此类等价性。为此,我们提出EquivPruner——一种简单而有效的方法,可在LLM推理搜索过程中识别并剪枝语义等价动作。我们还创建了首个数学陈述等价数据集MathEquiv,用于训练轻量级等价检测器。跨多种模型与任务的广泛实验表明,EquivPruner能显著降低标记消耗,提升搜索效率并常增强推理准确率。例如,在GSM8K数据集上应用Qwen2.5-Math-7B-Instruct模型时,EquivPruner将标记消耗降低48.1%,同时提高准确率。代码详见https://github.com/Lolo1222/EquivPruner。
How do Scaling Laws Apply to Knowledge Graph Engineering Tasks? The Impact of Model Size on Large Language Model Performance
Abstract
arXiv:2505.16276v1 Announce Type: new Abstract: When using Large Language Models (LLMs) to support Knowledge Graph Engineering (KGE), one of the first indications when searching for an appropriate model is its size. According to the scaling laws, larger models typically show higher capabilities. However, in practice, resource costs are also an important factor and thus it makes sense to consider the ratio between model performance and costs. The LLM-KG-Bench framework enables the comparison of LLMs in the context of KGE tasks and assesses their capabilities of understanding and producing KGs and KG queries. Based on a dataset created in an LLM-KG-Bench run covering 26 open state-of-the-art LLMs, we explore the model size scaling laws specific to KGE tasks. In our analyses, we assess how benchmark scores evolve between different model size categories. Additionally, we inspect how the general score development of single models and families of models correlates to their size. Our analyses revealed that, with a few exceptions, the model size scaling laws generally also apply to the selected KGE tasks. However, in some cases, plateau or ceiling effects occurred, i.e., the task performance did not change much between a model and the next larger model. In these cases, smaller models could be considered to achieve high cost-effectiveness. Regarding models of the same family, sometimes larger models performed worse than smaller models of the same family. These effects occurred only locally. Hence it is advisable to additionally test the next smallest and largest model of the same family.
摘要
当使用大语言模型(LLMs)支持知识图谱工程(KGE)时,搜索合适模型的第一个指标通常是其规模。根据缩放定律,较大模型通常表现出更高能力。然而在实际应用中,资源成本也是重要考量因素,因此需要权衡模型性能与成本之间的比率。LLM-KG-Bench框架能够比较LLMs在KGE任务中的表现,评估其理解和生成知识图谱及图谱查询的能力。基于LLM-KG-Bench运行中创建的涵盖26个开源最先进LLMs的数据集,我们探索了特定于KGE任务的模型规模缩放定律。在分析中,我们评估了不同规模类别模型间基准分数的演变情况,并检验了单个模型及同系列模型的总体得分发展与其规模的相关性。分析表明,除少数例外情况外,模型规模缩放定律通常也适用于所选KGE任务。但在某些情况下会出现平台或天花板效应,即模型与更大模型之间的任务性能变化不大。此类情况下,可考虑采用较小模型以实现高成本效益。对于同系列模型,有时较大模型表现反而逊于较小版本,这些效应仅局部出现。因此建议额外测试同系列中相邻更小和更大的模型。
Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning
Abstract
arXiv:2505.16315v1 Announce Type: new Abstract: Large reasoning models (LRMs) have demonstrated strong performance on complex reasoning tasks, but often suffer from overthinking, generating redundant content regardless of task difficulty. Inspired by the dual process theory in cognitive science, we propose Adaptive Cognition Policy Optimization (ACPO), a reinforcement learning framework that enables LRMs to achieve efficient reasoning through adaptive cognitive allocation and dynamic system switch. ACPO incorporates two key components: (1) introducing system-aware reasoning tokens to explicitly represent the thinking modes thereby making the model's cognitive process transparent, and (2) integrating online difficulty estimation and token length budget to guide adaptive system switch and reasoning during reinforcement learning. To this end, we propose a two-stage training strategy. The first stage begins with supervised fine-tuning to cold start the model, enabling it to generate reasoning paths with explicit thinking modes. In the second stage, we apply ACPO to further enhance adaptive system switch for difficulty-aware reasoning. Experimental results demonstrate that ACPO effectively reduces redundant reasoning while adaptively adjusting cognitive allocation based on task complexity, achieving efficient hybrid reasoning.
摘要
大规模推理模型(LRMs)在复杂推理任务中展现出强大性能,但普遍存在过度思考现象,即无论任务难度如何都会生成冗余内容。受认知科学中双过程理论启发,我们提出自适应认知策略优化(ACPO)——一种通过自适应认知分配与动态系统切换实现高效推理的强化学习框架。ACPO包含两个核心组件:(1)引入系统感知推理标记来显式表征思维模式,从而使模型的认知过程透明化;(2)集成在线难度评估与标记长度预算机制,以指导强化学习过程中的自适应系统切换与推理。为此,我们设计了两阶段训练策略:第一阶段通过监督微调冷启动模型,使其生成具有显式思维模式的推理路径;第二阶段应用ACPO进一步强化面向难度感知推理的自适应系统切换能力。实验结果表明,ACPO能有效减少冗余推理,同时根据任务复杂度自适应调整认知分配,实现高效的混合推理。
Smaller, Smarter, Closer: The Edge of Collaborative Generative AI
Abstract
arXiv:2505.16499v1 Announce Type: new Abstract: The rapid adoption of generative AI (GenAI), particularly Large Language Models (LLMs), has exposed critical limitations of cloud-centric deployments, including latency, cost, and privacy concerns. Meanwhile, Small Language Models (SLMs) are emerging as viable alternatives for resource-constrained edge environments, though they often lack the capabilities of their larger counterparts. This article explores the potential of collaborative inference systems that leverage both edge and cloud resources to address these challenges. By presenting distinct cooperation strategies alongside practical design principles and experimental insights, we offer actionable guidance for deploying GenAI across the computing continuum.
摘要
生成式人工智能(GenAI),尤其是大语言模型(LLMs)的快速普及,暴露出以云为中心部署模式的关键局限性,包括延迟、成本和隐私问题。与此同时,小语言模型(SLMs)正逐渐成为资源受限边缘环境的可行替代方案,但其能力通常不及大型模型。本文探讨了利用边缘与云计算资源的协同推理系统应对这些挑战的潜力。通过提出不同的协作策略,并结合实际设计原则与实验洞察,我们为在整个计算连续体上部署GenAI提供了可操作的指导。
Internal Bias in Reasoning Models leads to Overthinking
Abstract
arXiv:2505.16448v1 Announce Type: new Abstract: While current reasoning models possess strong exploratory capabilities, they are often criticized for overthinking due to redundant and unnecessary reflections. In this work, we reveal for the first time that overthinking in reasoning models may stem from their internal bias towards input texts. Upon encountering a reasoning problem, the model immediately forms a preliminary guess about the answer, which we term as an internal bias since it is not derived through actual reasoning. When this guess conflicts with its reasoning result, the model tends to engage in reflection, leading to the waste of computational resources. Through further interpretability experiments, we find that this behavior is largely driven by the model's excessive attention to the input section, which amplifies the influence of internal bias on its decision-making process. Additionally, by masking out the original input section, the affect of internal bias can be effectively alleviated and the reasoning length could be reduced by 31%-53% across different complex reasoning tasks. Notably, in most cases, this approach also leads to improvements in accuracy. These findings demonstrate a causal relationship between internal bias and overthinking.
摘要
当前推理模型虽具备强大的探索能力,却常因冗余且不必要的反思而遭受"过度思考"的诟病。本研究首次揭示推理模型的过度思考可能源于其对输入文本的内部偏见。当面对推理问题时,模型会立即形成对答案的初步猜测——这种未经实际推理产生的预判被我们定义为内部偏见。当该猜测与推理结果冲突时,模型倾向于启动反思机制,导致计算资源浪费。通过可解释性实验发现,该行为主要源于模型对输入段的过度关注,这种关注放大了内部偏见对决策过程的影响。实验表明,通过遮蔽原始输入段可有效缓解内部偏见效应,使不同复杂推理任务中的推理长度减少31%-53%。值得注意的是,在多数情况下该方法还能提升准确率。这些发现证实了内部偏见与过度思考之间存在因果关系。
ReflectEvo: Improving Meta Introspection of Small LLMs by Learning Self-Reflection
Abstract
arXiv:2505.16475v1 Announce Type: new Abstract: We present a novel pipeline, ReflectEvo, to demonstrate that small language models (SLMs) can enhance meta introspection through reflection learning. This process iteratively generates self-reflection for self-training, fostering a continuous and self-evolving process. Leveraging this pipeline, we construct ReflectEvo-460k, a large-scale, comprehensive, self-generated reflection dataset with broadened instructions and diverse multi-domain tasks. Building upon this dataset, we demonstrate the effectiveness of reflection learning to improve SLMs' reasoning abilities using SFT and DPO with remarkable performance, substantially boosting Llama-3 from 52.4% to 71.2% and Mistral from 44.4% to 71.1%. It validates that ReflectEvo can rival or even surpass the reasoning capability of the three prominent open-sourced models on BIG-bench without distillation from superior models or fine-grained human annotation. We further conduct a deeper analysis of the high quality of self-generated reflections and their impact on error localization and correction. Our work highlights the potential of continuously enhancing the reasoning performance of SLMs through iterative reflection learning in the long run.
摘要
我们提出了一种新型流程ReflectEvo,证明小语言模型(SLMs)能通过反思学习增强元自省能力。该流程通过迭代生成自我反思进行自训练,形成持续自我进化的过程。基于此,我们构建了ReflectEvo-460k——一个大规模、综合性、自生成的反思数据集,包含扩展指令和多样化的多领域任务。利用该数据集,我们通过监督微调(SFT)和直接偏好优化(DPO)验证了反思学习对提升SLMs推理能力的显著效果:Llama-3的准确率从52.4%提升至71.2%,Mistral从44.4%提升至71.1%。这表明ReflectEvo无需依赖上级模型蒸馏或精细人工标注,即可媲美甚至超越三大知名开源模型在BIG-bench上的推理能力。我们进一步深入分析了自生成反思的高质量特性及其对错误定位与修正的影响。本研究揭示了通过迭代反思学习持续提升SLMs推理性能的长期潜力。
Advancing the Scientific Method with Large Language Models: From Hypothesis to Discovery
Abstract
arXiv:2505.16477v1 Announce Type: new Abstract: With recent Nobel Prizes recognising AI contributions to science, Large Language Models (LLMs) are transforming scientific research by enhancing productivity and reshaping the scientific method. LLMs are now involved in experimental design, data analysis, and workflows, particularly in chemistry and biology. However, challenges such as hallucinations and reliability persist. In this contribution, we review how Large Language Models (LLMs) are redefining the scientific method and explore their potential applications across different stages of the scientific cycle, from hypothesis testing to discovery. We conclude that, for LLMs to serve as relevant and effective creative engines and productivity enhancers, their deep integration into all steps of the scientific process should be pursued in collaboration and alignment with human scientific goals, with clear evaluation metrics. The transition to AI-driven science raises ethical questions about creativity, oversight, and responsibility. With careful guidance, LLMs could evolve into creative engines, driving transformative breakthroughs across scientific disciplines responsibly and effectively. However, the scientific community must also decide how much it leaves to LLMs to drive science, even when associations with 'reasoning', mostly currently undeserved, are made in exchange for the potential to explore hypothesis and solution regions that might otherwise remain unexplored by human exploration alone.
摘要
随着近年诺贝尔奖对人工智能科学贡献的认可,大语言模型(LLMs)正通过提升生产力和重塑科研方法变革科学研究。当前LLMs已参与化学、生物学等领域的实验设计、数据分析和工作流程,但仍存在幻觉与可靠性等挑战。本文系统评述了大语言模型如何重新定义科学方法,并探讨其在假设检验到科学发现等科研周期各阶段的潜在应用。我们得出结论:要使LLMs成为相关且高效的创意引擎与生产力增强工具,需通过明确评估指标,使其深度融入科研全流程并与人类科学目标协同。向AI驱动科学的转型引发了关于创造性、监督与责任的伦理问题。在审慎引导下,LLMs或可发展为负责任且高效的创意引擎,推动跨学科突破性进展。但科学界仍需权衡:即便在探索人类单独研究可能无法触及的假设与解决方案领域时,当以尚不成熟的"推理"能力为交换条件,究竟应让LLMs在多大程度上主导科研进程。
FREESON: Retriever-Free Retrieval-Augmented Reasoning via Corpus-Traversing MCTS
Abstract
arXiv:2505.16409v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in multi-step reasoning and calling search engines at appropriate steps. However, existing retrieval-augmented reasoning approaches rely on separate retrieval models, limiting the LRM's role in retrieval to deciding when to retrieve and how to query. This separation not only increases hardware and operational costs but also leads to errors in the retrieval process due to the representation bottleneck, a phenomenon where the retriever's embedding space is not expressive enough to meet the generator's requirements. To address this, we shift our perspective from sequence-to-sequence matching to locating the answer-containing paths within the corpus, and propose a novel framework called FREESON (Retriever-FREE Retrieval-Augmented ReaSONing). This framework enables LRMs to retrieve relevant knowledge on their own by acting as both a generator and retriever. To achieve this, we introduce a variant of the MCTS algorithm specialized for the retrieval task, which we call CT-MCTS (Corpus-Traversing Monte Carlo Tree Search). In this algorithm, LRMs traverse through the corpus toward answer-containing regions. Our results on five open-domain QA benchmarks, including single-hop and multi-hop questions, show that FREESON achieves an average improvement of 14.4% in EM and F1 over four multi-step reasoning models with a separate retriever, and it also performs comparably to the strongest baseline, surpassing it by 3% on PopQA and 2WikiMultihopQA.
摘要
大型推理模型(LRMs)在多步推理和适时调用搜索引擎方面展现出卓越能力。然而现有检索增强推理方法依赖独立的检索模型,将LRMs在检索中的作用局限于决定检索时机与查询方式。这种分离不仅增加了硬件与运维成本,更因表征瓶颈现象(即检索器嵌入空间无法充分满足生成器需求)导致检索过程出现误差。为此,我们突破序列到序列匹配的范式,转向在语料库中定位包含答案的路径,提出名为FREESON(无检索器的检索增强推理)的新型框架。该框架通过使LRMs兼具生成器与检索器功能,实现自主检索相关知识。为此,我们提出专用于检索任务的MCTS算法变体——CT-MCTS(语料库遍历蒙特卡洛树搜索),在该算法中LRMs沿语料库向答案所在区域进行遍历。在五个开放域QA基准(含单跳与多跳问题)上的实验表明:相较于四个配备独立检索器的多步推理模型,FREESON在EM和F1指标上平均提升14.4%;其性能与最强基线相当,并在PopQA和2WikiMultihopQA上分别超出3%与2%。
Edge-First Language Model Inference: Models, Metrics, and Tradeoffs
Abstract
arXiv:2505.16508v1 Announce Type: new Abstract: The widespread adoption of Language Models (LMs) across industries is driving interest in deploying these services across the computing continuum, from the cloud to the network edge. This shift aims to reduce costs, lower latency, and improve reliability and privacy. Small Language Models (SLMs), enabled by advances in model compression, are central to this shift, offering a path to on-device inference on resource-constrained edge platforms. This work examines the interplay between edge and cloud deployments, starting from detailed benchmarking of SLM capabilities on single edge devices, and extending to distributed edge clusters. We identify scenarios where edge inference offers comparable performance with lower costs, and others where cloud fallback becomes essential due to limits in scalability or model capacity. Rather than proposing a one-size-fits-all solution, we present platform-level comparisons and design insights for building efficient, adaptive LM inference systems across heterogeneous environments.
摘要
语言模型(LMs)在各行业的广泛应用推动了人们将其服务部署于从云端到网络边缘的整个计算连续体中的兴趣。这一转变旨在降低成本、减少延迟,并提升可靠性和隐私性。得益于模型压缩技术的进步,小型语言模型(SLMs)成为这一转变的核心,为资源受限的边缘平台提供了设备端推理的途径。本研究探讨了边缘与云端部署之间的相互作用,从单一边缘设备上SLM能力的详细基准测试出发,延伸至分布式边缘集群。我们识别了边缘推理在性能相当且成本更低时的适用场景,以及由于可扩展性或模型容量限制而必须依赖云端回退的其他场景。我们并未提出一刀切的解决方案,而是提供了平台级比较和设计见解,以构建跨异构环境的高效、自适应LM推理系统。
MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks
Abstract
arXiv:2505.16459v1 Announce Type: new Abstract: Recent advances in Multi-Modal Large Language Models (MLLMs) have enabled unified processing of language, vision, and structured inputs, opening the door to complex tasks such as logical deduction, spatial reasoning, and scientific analysis. Despite their promise, the reasoning capabilities of MLLMs, particularly those augmented with intermediate thinking traces (MLLMs-T), remain poorly understood and lack standardized evaluation benchmarks. Existing work focuses primarily on perception or final answer correctness, offering limited insight into how models reason or fail across modalities. To address this gap, we introduce the MMMR, a new benchmark designed to rigorously evaluate multi-modal reasoning with explicit thinking. The MMMR comprises 1) a high-difficulty dataset of 1,083 questions spanning six diverse reasoning types with symbolic depth and multi-hop demands and 2) a modular Reasoning Trace Evaluation Pipeline (RTEP) for assessing reasoning quality beyond accuracy through metrics like relevance, consistency, and structured error annotations. Empirical results show that MLLMs-T overall outperform non-thinking counterparts, but even top models like Claude-3.7-Sonnet and Gemini-2.5 Pro suffer from reasoning pathologies such as inconsistency and overthinking. This benchmark reveals persistent gaps between accuracy and reasoning quality and provides an actionable evaluation pipeline for future model development. Overall, the MMMR offers a scalable foundation for evaluating, comparing, and improving the next generation of multi-modal reasoning systems.
摘要
多模态大语言模型(MLLMs)的最新进展实现了对语言、视觉和结构化输入的统一处理,为逻辑推理、空间推理和科学分析等复杂任务开辟了道路。尽管前景广阔,但MLLMs(特别是增强中间思维轨迹的MLLMs-T)的推理能力仍未被充分理解,且缺乏标准化评估基准。现有研究主要关注感知或最终答案的正确性,对模型跨模态的推理过程或失败原因提供有限洞察。为填补这一空白,我们提出了MMMR基准——一个专门用于严格评估显性思维多模态推理的新基准。该基准包含:1) 一个高难度数据集,涵盖六种具有符号深度和多跳需求的多样化推理类型,共1,083个问题;2) 模块化推理轨迹评估管道(RTEP),通过相关性、一致性和结构化错误标注等指标,超越准确率评估推理质量。实验结果表明,MLLMs-T总体优于非思维增强模型,但即使是Claude-3.7-Sonnet和Gemini-2.5 Pro等顶级模型仍存在不一致性和过度思考等推理缺陷。该基准揭示了准确率与推理质量之间的持续差距,并为未来模型开发提供了可操作的评估框架。总体而言,MMMR为评估、比较和改进下一代多模态推理系统提供了可扩展的基础。
Recursive Offloading for LLM Serving in Multi-tier Networks
Abstract
arXiv:2505.16502v1 Announce Type: new Abstract: Heterogeneous device-edge-cloud computing infrastructures have become widely adopted in telecommunication operators and Wide Area Networks (WANs), offering multi-tier computational support for emerging intelligent services. With the rapid proliferation of Large Language Model (LLM) services, efficiently coordinating inference tasks and reducing communication overhead within these multi-tier network architectures becomes a critical deployment challenge. Existing LLM serving paradigms exhibit significant limitations: on-device deployment supports only lightweight LLMs due to hardware constraints, while cloud-centric deployment suffers from resource congestion and considerable prompt communication overhead caused by frequent service requests during peak periods. Although the model-cascading-based inference strategy adapts better to multi-tier networks, its reliance on fine-grained, manually adjusted thresholds makes it less responsive to dynamic network conditions and varying task complexities. To address these challenges, we propose RecServe, a recursive offloading framework tailored for LLM serving in multi-tier networks. RecServe integrates a task-specific hierarchical confidence evaluation mechanism that guides offloading decisions based on inferred task complexity in progressively scaled LLMs across device, edge, and cloud tiers. To further enable intelligent task routing across tiers, RecServe employs a sliding-window-based dynamic offloading strategy with quantile interpolation, enabling real-time tracking of historical confidence distributions and adaptive offloading threshold adjustments. Experiments on eight datasets demonstrate that RecServe outperforms CasServe in both service quality and communication efficiency, and reduces the communication burden by over 50% compared to centralized cloud-based serving.
摘要
异构设备-边缘-云计算基础设施已在电信运营商和广域网(WAN)中得到广泛应用,为新兴智能服务提供多层次计算支持。随着大语言模型(LLM)服务的快速普及,如何在这种多层网络架构中高效协调推理任务并降低通信开销成为关键部署挑战。现有LLM服务范式存在显著局限:受硬件限制,设备端部署仅支持轻量级LLM;而以云为中心的部署则面临资源拥塞和高峰时段频繁服务请求导致的巨大提示词通信开销。虽然基于模型级联的推理策略更适应多层网络,但其依赖细粒度人工调整阈值的方式难以响应动态网络条件和多变任务复杂度。为此,我们提出RecServe——一个专为多层网络LLM服务设计的递归卸载框架。该框架整合了面向任务的分层置信度评估机制,通过跨设备、边缘和云层级逐步扩展的LLM来推断任务复杂度,从而指导卸载决策。为进一步实现跨层级智能任务路由,RecServe采用基于滑动窗口的分位数插值动态卸载策略,实时追踪历史置信度分布并自适应调整卸载阈值。在八个数据集上的实验表明,RecServe在服务质量和通信效率上均优于CasServe,相比集中式云服务可降低50%以上的通信负担。
Is Your LLM-Based Multi-Agent a Reliable Real-World Planner? Exploring Fraud Detection in Travel Planning
Abstract
arXiv:2505.16557v1 Announce Type: new Abstract: The rise of Large Language Model-based Multi-Agent Planning has leveraged advanced frameworks to enable autonomous and collaborative task execution. Some systems rely on platforms like review sites and social media, which are prone to fraudulent information, such as fake reviews or misleading descriptions. This reliance poses risks, potentially causing financial losses and harming user experiences. To evaluate the risk of planning systems in real-world applications, we introduce \textbf{WandaPlan}, an evaluation environment mirroring real-world data and injected with deceptive content. We assess system performance across three fraud cases: Misinformation Fraud, Team-Coordinated Multi-Person Fraud, and Level-Escalating Multi-Round Fraud. We reveal significant weaknesses in existing frameworks that prioritize task efficiency over data authenticity. At the same time, we validate WandaPlan's generalizability, capable of assessing the risks of real-world open-source planning frameworks. To mitigate the risk of fraud, we propose integrating an anti-fraud agent, providing a solution for reliable planning.
摘要
基于大语言模型的多智能体规划系统的兴起,利用先进框架实现了自主协作的任务执行。现有系统多依赖点评网站和社交媒体等易受欺诈信息(如虚假评论或误导性描述)影响的平台,这种依赖性可能引发财务损失和损害用户体验的风险。为评估规划系统在现实应用中的风险,我们提出WandaPlan评估环境,该环境模拟真实数据并注入欺骗性内容。我们通过三类欺诈案例(虚假信息欺诈、团队协作多人欺诈、层级递进多轮欺诈)评估系统性能,发现现有框架因优先考虑任务效率而忽视数据真实性存在重大缺陷。同时验证了WandaPlan的泛化能力,可有效评估现实开源规划框架的风险。为降低欺诈风险,我们提出集成反欺诈智能体的方案,为可靠规划提供解决路径。
Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning
Abstract
arXiv:2505.16579v1 Announce Type: new Abstract: While chains-of-thought (CoT) have advanced complex reasoning in multimodal large language models (MLLMs), existing methods remain confined to text or static visual domains, often faltering in dynamic spatial reasoning tasks. To bridge this gap, we present GRASSLAND, a novel maze navigation benchmark designed to evaluate dynamic spatial reasoning. Our experiments show that augmenting textual reasoning chains with dynamic visual drafts, overlaid on input images, significantly outperforms conventional approaches, offering new insights into spatial reasoning in evolving environments. To generalize this capability, we propose D2R (Dynamic Draft-Augmented Reasoning), a training-free framework that seamlessly integrates textual CoT with corresponding visual drafts into MLLMs. Extensive evaluations demonstrate that D2R consistently enhances performance across diverse tasks, establishing a robust baseline for dynamic spatial reasoning without requiring model fine-tuning. Project is open at https://github.com/Cratileo/D2R.
摘要
尽管思维链(CoT)技术在多模态大语言模型(MLLMs)中推动了复杂推理的发展,但现有方法仍局限于文本或静态视觉领域,在动态空间推理任务中往往表现不佳。为弥补这一不足,我们提出了GRASSLAND——一个专为评估动态空间推理而设计的新型迷宫导航基准测试。实验表明,通过在输入图像上叠加动态视觉草图来增强文本推理链,能显著超越传统方法,为动态环境中的空间推理提供了新见解。为推广这一能力,我们提出D2R(动态草图增强推理),这是一种免训练框架,可将文本CoT与相应视觉草图无缝集成到MLLMs中。大量评估证明,D2R能持续提升各类任务的性能,在不需模型微调的情况下为动态空间推理建立了稳健的基准。项目开源地址:https://github.com/Cratileo/D2R。
SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving
Abstract
arXiv:2505.16646v1 Announce Type: new Abstract: Large Language Models have achieved remarkable results on a variety of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine mathematical reasoning or superficial pattern recognition. Common evaluation metrics, such as final answer accuracy, fail to disentangle the underlying competencies involved, offering limited diagnostic value. To address these limitations, we introduce SMART: a Self-Generating and Self-Validating Multi-Dimensional Assessment Framework. SMART decomposes mathematical problem solving into four distinct dimensions: understanding, reasoning, arithmetic, and reflection & refinement. Each dimension is evaluated independently through tailored tasks, enabling interpretable and fine-grained analysis of LLM behavior. Crucially, SMART integrates an automated self-generating and self-validating mechanism to produce and verify benchmark data, ensuring both scalability and reliability. We apply SMART to 21 state-of-the-art open- and closed-source LLMs, uncovering significant discrepancies in their abilities across different dimensions. Our findings demonstrate the inadequacy of final answer accuracy as a sole metric and motivate a new holistic metric to better capture true problem-solving capabilities. Code and benchmarks will be released upon acceptance.
摘要
大语言模型在各类数学基准测试中取得了显著成果。然而,这些成功究竟反映真实的数学推理能力还是表面的模式识别,仍存疑虑。现有常用评估指标(如最终答案准确率)无法区分潜在的核心能力要素,诊断价值有限。为此,我们提出SMART框架:一种自生成自验证的多维评估体系。该框架将数学问题解决分解为四个独立维度——理解、推理、算术以及反思与优化,通过定制化任务对各维度进行独立评估,从而实现对大语言模型行为可解释、细粒度的分析。关键创新在于整合了自动化自生成与自验证机制来生产并校验基准数据,确保评估的可扩展性与可靠性。我们对21个最先进的开源与闭源大语言模型进行测试,发现不同维度能力存在显著差异。研究结果证明仅凭最终答案准确率作为单一指标的不足,并推动建立新的综合评价指标以更准确捕捉真实问题解决能力。代码与基准测试数据将在论文录用后公开发布。
ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming
Abstract
arXiv:2505.16667v1 Announce Type: new Abstract: While recent research increasingly emphasizes the value of human-LLM collaboration in competitive programming and proposes numerous empirical methods, a comprehensive understanding remains elusive due to the fragmented nature of existing studies and their use of diverse, application-specific human feedback. Thus, our work serves a three-fold purpose: First, we present the first taxonomy of human feedback consolidating the entire programming process, which promotes fine-grained evaluation. Second, we introduce ELABORATIONSET, a novel programming dataset specifically designed for human-LLM collaboration, meticulously annotated to enable large-scale simulated human feedback and facilitate costeffective real human interaction studies. Third, we introduce ELABORATION, a novel benchmark to facilitate a thorough assessment of human-LLM competitive programming. With ELABORATION, we pinpoint strengthes and weaknesses of existing methods, thereby setting the foundation for future improvement. Our code and dataset are available at https://github.com/SCUNLP/ELABORATION
摘要
尽管近期研究日益强调人类与大型语言模型(LLM)在竞技编程中协作的价值,并提出了多种实证方法,但由于现有研究呈现碎片化特征且采用多样化的应用特定人类反馈,全面理解仍显不足。为此,本研究实现三重目标:首先,我们提出首个整合完整编程流程的人类反馈分类体系,支持细粒度评估。其次,我们推出ELABORATIONSET——一个专为人类-LLM协作设计的新型编程数据集,通过精细标注支持大规模模拟人类反馈,并为经济高效的真人交互研究提供基础。第三,我们建立ELABORATION基准测试,以系统评估人类-LLM竞技编程表现。借助该基准,我们精准识别现有方法的优势与不足,为未来改进奠定基础。代码与数据集详见https://github.com/SCUNLP/ELABORATION。
Data-Driven Breakthroughs and Future Directions in AI Infrastructure: A Comprehensive Review
Abstract
arXiv:2505.16771v1 Announce Type: new Abstract: This paper presents a comprehensive synthesis of major breakthroughs in artificial intelligence (AI) over the past fifteen years, integrating historical, theoretical, and technological perspectives. It identifies key inflection points in AI' s evolution by tracing the convergence of computational resources, data access, and algorithmic innovation. The analysis highlights how researchers enabled GPU based model training, triggered a data centric shift with ImageNet, simplified architectures through the Transformer, and expanded modeling capabilities with the GPT series. Rather than treating these advances as isolated milestones, the paper frames them as indicators of deeper paradigm shifts. By applying concepts from statistical learning theory such as sample complexity and data efficiency, the paper explains how researchers translated breakthroughs into scalable solutions and why the field must now embrace data centric approaches. In response to rising privacy concerns and tightening regulations, the paper evaluates emerging solutions like federated learning, privacy enhancing technologies (PETs), and the data site paradigm, which reframe data access and security. In cases where real world data remains inaccessible, the paper also assesses the utility and constraints of mock and synthetic data generation. By aligning technical insights with evolving data infrastructure, this study offers strategic guidance for future AI research and policy development.
摘要
本文对过去十五年间人工智能(AI)领域的重大突破进行了全面综合,整合了历史、理论和技术的多维视角。通过追踪计算资源、数据获取与算法创新的融合轨迹,研究界定了AI演进过程中的关键转折点。分析着重阐释了研究者如何实现基于GPU的模型训练、通过ImageNet引发以数据为中心的范式转移、借助Transformer简化架构,以及利用GPT系列拓展建模能力。论文并未将这些进展视为孤立里程碑,而是将其作为深层范式转变的指示标。通过运用统计学习理论中的样本复杂度和数据效率等概念,研究揭示了突破性成果如何转化为可扩展解决方案,并阐明了该领域为何必须转向以数据为中心的方法。针对日益增长的隐私顾虑与监管收紧,论文评估了联邦学习、隐私增强技术(PETs)以及重构数据访问与安全的数据站点范式等新兴解决方案。对于现实数据难以获取的场景,研究还评估了模拟与合成数据生成的效用与限制。通过将技术洞见与演进中的数据基础设施相衔接,本研究为未来AI研究与政策制定提供了战略指引。
MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models
Abstract
arXiv:2505.16700v1 Announce Type: new Abstract: As Large Language Models (LLMs) evolve from passive text generators to active reasoning agents capable of tool interaction, the Model Context Protocol (MCP) has emerged as a standardized framework for dynamic tool discovery and orchestration. Despite widespread industry adoption, existing evaluation methodologies fail to adequately assess tool utilization capabilities within this new paradigm. This paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate LLM performance in the MCP framework through a novel five-dimensional approach measuring: answer accuracy, tool selection efficiency, computational resource efficiency, parameter construction accuracy, and execution speed. Unlike conventional benchmarks that rely on subjective human evaluations or binary success metrics, MCP-RADAR employs objective, quantifiable measurements across multiple task domains including software engineering, mathematical reasoning, and general problem-solving. Our evaluations of leading commercial and open-source LLMs reveal distinctive capability profiles with significant trade-offs between accuracy, efficiency, and speed, challenging traditional single-metric performance rankings. Besides, we provide valuable guidance for developers to optimize their tools for maximum model compatibility and effectiveness. While focused on MCP due to its standardized approach, our methodology remains applicable across all LLM agent tool integration frameworks, providing valuable insights for both LLM developers and tool creators to optimize the entire LLM-tool interaction ecosystem. The implementation, configurations, and datasets used in our evaluation are publicly available at https://anonymous.4open.science/r/MCPRadar-B143.
摘要
随着大型语言模型(LLMs)从被动文本生成器发展为具备工具交互能力的主动推理智能体,模型上下文协议(MCP)已成为动态工具发现与编排的标准化框架。尽管该框架已在工业界广泛应用,现有评估方法仍无法充分衡量这一新范式下的工具利用能力。本文提出首个专为MCP框架设计的综合基准测试MCP-RADAR,通过创新性的五维评估体系进行性能度量:答案准确性、工具选择效率、计算资源效率、参数构建准确性和执行速度。与传统依赖主观人工评估或二元成功指标的基准不同,MCP-RADAR采用客观量化指标,覆盖软件工程、数学推理和通用问题求解等多任务领域。我们对主流商业及开源LLMs的评估揭示了各模型在准确性、效率与速度之间存在显著权衡的独特能力特征,这对传统单一指标性能排名提出了挑战。此外,我们为开发者提供了优化工具以实现最大模型兼容性和有效性的实用指南。虽然研究聚焦于标准化的MCP框架,但该方法论可适用于所有LLM智能体工具集成框架,为LLM开发者和工具创建者优化整体交互生态系统提供了重要参考。评估所用的实现方案、配置及数据集已公开于https://anonymous.4open.science/r/MCPRadar-B143。
KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning
Abstract
arXiv:2505.16826v1 Announce Type: new Abstract: Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models, even without supervised fine-tuning. However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they compute rollout-level advantages that assign identical values to every token within a sequence, failing to capture token-specific contributions and hindering effective learning. To address this limitation, we propose Key-token Advantage Estimation (KTAE) - a novel algorithm that estimates fine-grained, token-level advantages without introducing additional models. KTAE leverages the correctness of sampled rollouts and applies statistical analysis to quantify the importance of individual tokens within a sequence to the final outcome. This quantified token-level importance is then combined with the rollout-level advantage to obtain a more fine-grained token-level advantage estimation. Empirical results show that models trained with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five mathematical reasoning benchmarks. Notably, they achieve higher accuracy with shorter responses and even surpass R1-Distill-Qwen-1.5B using the same base model.
摘要
近期研究表明,将强化学习与基于规则的奖励相结合,即使无需监督微调,也能显著增强大语言模型的推理能力。然而,当前主流强化学习算法如GRPO及其变体DAPO在计算优势值时存在粒度粗放问题。具体而言,这些算法通过序列级优势计算为同一序列中的所有标记分配相同值,无法捕捉标记级贡献,从而阻碍有效学习。为突破这一局限,我们提出关键标记优势估计(KTAE)——一种无需引入额外模型即可实现细粒度标记级优势估计的新算法。KTAE通过统计分析方法,利用采样序列的正确性量化序列中单个标记对最终结果的贡献度,并将该量化结果与序列级优势值结合,获得更精细的标记级优势估计。实验结果表明,采用GRPO+KTAE和DAPO+KTAE训练的模型在五项数学推理基准测试中均超越基线方法。值得注意的是,这些模型能以更短的响应长度实现更高准确率,甚至在使用相同基础模型时超越R1-Distill-Qwen-1.5B。
Beyond Correlation: Towards Causal Large Language Model Agents in Biomedicine
Abstract
arXiv:2505.16982v1 Announce Type: new Abstract: Large Language Models (LLMs) show promise in biomedicine but lack true causal understanding, relying instead on correlations. This paper envisions causal LLM agents that integrate multimodal data (text, images, genomics, etc.) and perform intervention-based reasoning to infer cause-and-effect. Addressing this requires overcoming key challenges: designing safe, controllable agentic frameworks; developing rigorous benchmarks for causal evaluation; integrating heterogeneous data sources; and synergistically combining LLMs with structured knowledge (KGs) and formal causal inference tools. Such agents could unlock transformative opportunities, including accelerating drug discovery through automated hypothesis generation and simulation, enabling personalized medicine through patient-specific causal models. This research agenda aims to foster interdisciplinary efforts, bridging causal concepts and foundation models to develop reliable AI partners for biomedical progress.
摘要
大型语言模型(LLMs)在生物医学领域展现出潜力,但其依赖相关性而非真正的因果理解。本文提出构建因果型LLM智能体的愿景,通过整合多模态数据(文本、图像、基因组学等)并进行基于干预的推理来实现因果关系推断。实现这一目标需攻克以下关键挑战:设计安全可控的智能体框架、建立严格的因果评估基准、融合异构数据源,以及协同结合LLMs与结构化知识图谱(KGs)和形式化因果推理工具。此类智能体有望开启变革性机遇,包括通过自动化假设生成与模拟加速药物发现、基于患者特异性因果模型实现个性化医疗。本研究议程旨在促进跨学科合作, bridging 因果概念与基础模型,为生物医学进步开发可靠的人工智能合作伙伴。
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
Abstract
arXiv:2505.16854v1 Announce Type: new Abstract: Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches. Our code is available at https://github.com/kokolerk/TON.
摘要
强化学习(RL)已被证明是一种有效的后训练策略,可增强视觉语言模型(VLM)的推理能力。组相对策略优化(GRPO)是近期的一种重要方法,它鼓励模型在回答前生成完整的推理轨迹,但这会导致标记使用量和计算成本增加。受人类思维过程的启发——人们在简单问题上跳过推理,而在需要时仔细思考——我们探索如何让VLM首先决定何时需要推理。为实现这一目标,我们提出了TON,一种两阶段训练策略:(i)监督微调(SFT)阶段,采用简单而有效的“思维丢弃”操作,随机将推理轨迹替换为空思维。这引入了一种“思考与否”的格式,为选择性推理提供了冷启动;(ii)GRPO阶段,使模型能够自由探索何时思考或不思考,同时最大化任务感知的结果奖励。实验结果表明,与原始GRPO相比,TON可将完成长度减少高达90%,且不会牺牲性能甚至有所提升。在多种视觉语言任务(涵盖3B和7B模型下不同推理难度)的进一步评估中,一致发现模型随着训练的推进逐渐学会跳过不必要的推理步骤。这些发现为强化学习方法中实现类人推理模式提供了启示。我们的代码可在https://github.com/kokolerk/TON获取。
HyGenar: An LLM-Driven Hybrid Genetic Algorithm for Few-Shot Grammar Generation
Abstract
arXiv:2505.16978v1 Announce Type: new Abstract: Grammar plays a critical role in natural language processing and text/code generation by enabling the definition of syntax, the creation of parsers, and guiding structured outputs. Although large language models (LLMs) demonstrate impressive capabilities across domains, their ability to infer and generate grammars has not yet been thoroughly explored. In this paper, we aim to study and improve the ability of LLMs for few-shot grammar generation, where grammars are inferred from sets of a small number of positive and negative examples and generated in Backus-Naur Form. To explore this, we introduced a novel dataset comprising 540 structured grammar generation challenges, devised 6 metrics, and evaluated 8 various LLMs against it. Our findings reveal that existing LLMs perform sub-optimally in grammar generation. To address this, we propose an LLM-driven hybrid genetic algorithm, namely HyGenar, to optimize grammar generation. HyGenar achieves substantial improvements in both the syntactic and semantic correctness of generated grammars across LLMs.
摘要
语法在自然语言处理和文本/代码生成中具有关键作用,它能够定义句法结构、创建解析器并指导结构化输出。尽管大语言模型(LLMs)在各领域展现出卓越能力,但其推断和生成语法的能力尚未得到深入探索。本文旨在研究并提升LLMs在小样本语法生成中的能力,即从少量正负示例中推断语法并以巴科斯-诺尔范式生成。为此,我们构建了一个包含540项结构化语法生成挑战的新数据集,设计了6项评估指标,并对8种不同LLMs进行了系统测试。研究发现现有LLMs在语法生成任务中表现欠佳。针对此问题,我们提出了一种基于LLM的混合遗传算法HyGenar来优化语法生成。实验表明,HyGenar能显著提升跨模型生成语法在句法和语义层面的正确性。
Know the Ropes: A Heuristic Strategy for LLM-based Multi-Agent System Design
Abstract
arXiv:2505.16979v1 Announce Type: new Abstract: Single-agent LLMs hit hard limits--finite context, role overload, and brittle domain transfer. Conventional multi-agent fixes soften those edges yet expose fresh pains: ill-posed decompositions, fuzzy contracts, and verification overhead that blunts the gains. We therefore present Know-The-Ropes (KtR), a framework that converts domain priors into an algorithmic blueprint hierarchy, in which tasks are recursively split into typed, controller-mediated subtasks, each solved zero-shot or with the lightest viable boost (e.g., chain-of-thought, micro-tune, self-check). Grounded in the No-Free-Lunch theorem, KtR trades the chase for a universal prompt for disciplined decomposition. On the Knapsack problem (3-8 items), three GPT-4o-mini agents raise accuracy from 3% zero-shot to 95% on size-5 instances after patching a single bottleneck agent. On the tougher Task-Assignment problem (6-15 jobs), a six-agent o3-mini blueprint hits 100% up to size 10 and 84% on sizes 13-15, versus 11% zero-shot. Algorithm-aware decomposition plus targeted augmentation thus turns modest models into reliable collaborators--no ever-larger monoliths required.
摘要
单智能体大语言模型面临三大瓶颈:有限上下文容量、角色过载和脆弱的领域迁移能力。传统多智能体方案虽能缓解这些问题,却引入了新痛点:任务分解失当、合约定义模糊以及验证开销抵消性能增益。为此,我们提出Know-The-Ropes(KtR)框架,将领域先验转化为算法蓝图层级结构,通过递归分解为类型化、控制器协调的子任务,采用零样本或最小可行增强策略(如思维链、微调、自检)求解。基于"没有免费午餐"定理,KtR摒弃通用提示词的追求,转向结构化任务分解。在背包问题(3-8物品)中,三个GPT-4o-mini智能体通过修补单个瓶颈节点,将5物品实例的准确率从零样本的3%提升至95%。在更复杂的任务分配问题(6-15作业)中,六智能体o3-mini蓝图在10作业规模实现100%准确率,13-15作业规模达84%,远超零样本11%的表现。算法感知分解结合精准增强,使中等模型即可成为可靠协作体——无需持续堆砌巨型单体模型。
Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning
Abstract
arXiv:2502.15401v1 Announce Type: cross Abstract: In-context learning (ICL) can significantly enhance the complex reasoning capabilities of large language models (LLMs), with the key lying in the selection and ordering of demonstration examples. Previous methods typically relied on simple features to measure the relevance between examples. We argue that these features are not sufficient to reflect the intrinsic connections between examples. In this study, we propose a curriculum ICL strategy guided by problem-solving logic. We select demonstration examples by analyzing the problem-solving logic and order them based on curriculum learning. Specifically, we constructed a problem-solving logic instruction set based on the BREAK dataset and fine-tuned a language model to analyze the problem-solving logic of examples. Subsequently, we selected appropriate demonstration examples based on problem-solving logic and assessed their difficulty according to the number of problem-solving steps. In accordance with the principles of curriculum learning, we ordered the examples from easy to hard to serve as contextual prompts. Experimental results on multiple benchmarks indicate that our method outperforms previous ICL approaches in terms of performance and efficiency, effectively enhancing the complex reasoning capabilities of LLMs. Our project will be publicly available subsequently.
摘要
上下文学习(ICL)能显著增强大语言模型(LLMs)的复杂推理能力,其核心在于演示样例的选择与排序。现有方法通常依赖简单特征衡量样例间相关性,我们认为这些特征不足以反映样例间的内在联系。本研究提出一种基于问题解决逻辑的课程式ICL策略:通过分析问题解决逻辑选择演示样例,并依据课程学习原则进行排序。具体而言,我们基于BREAK数据集构建问题解决逻辑指令集,微调语言模型以解析样例的问题解决逻辑;随后根据逻辑匹配度筛选演示样例,并依据解题步骤数量评估难度。遵循课程学习原理,将样例按从易到难排序作为上下文提示。多基准测试表明,本方法在性能与效率上均优于现有ICL方案,能有效提升LLMs的复杂推理能力。项目代码后续将公开。
X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs
Abstract
arXiv:2505.16997v1 Announce Type: new Abstract: LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system's intelligence to the limit of that model. This paper explores the paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by diverse LLMs, elevating the system's potential to the collective intelligence of diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS-related functions. As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain-function combination. Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM-driven MAS can significantly enhance system performance without requiring structural redesign. Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration yields up to 8.4% performance improvement on the MATH dataset. In a mixed chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable 47% performance boost on the AIME dataset. Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems.
摘要
基于大语言模型(LLM)的多智能体系统(MAS)通过多个专业化智能体的协作,扩展了单一LLM的能力。然而,现有大多数MAS框架依赖单一LLM驱动所有智能体,将系统智能限制在该模型的能力范围内。本文探索异构LLM驱动的多智能体系统(X-MAS)范式,其中智能体由多样化LLM驱动,将系统潜力提升至多样化LLM的集体智能水平。我们提出X-MAS-Bench——一个旨在评估不同领域及MAS相关功能中各类LLM性能的综合测试平台。作为一项大规模实证研究,我们在5个领域(涵盖21个测试集)和5种功能上评估了27个LLM,通过超过170万次测试确定各领域-功能组合的最优模型选择。基于这些发现,我们证明从同构转向异构LLM驱动的MAS可显著提升系统性能,而无需结构重构。具体而言,在纯聊天机器人MAS场景中,异构配置使MATH数据集上的性能提升达8.4%;在混合聊天机器人-推理器场景中,异构MAS可在AIME数据集上实现47%的显著性能提升。这些结果揭示了异构LLM在MAS中的变革潜力,为推进可扩展的协作式AI系统指明了一条前景广阔的路径。
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios
Abstract
arXiv:2505.16944v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.
摘要
大语言模型(LLMs)在现实世界的代理应用中已展现出先进能力。越来越多的研究致力于开发基于LLM的代理以满足实际需求,这带来了一项新挑战:代理场景通常涉及包含复杂约束的长篇指令,例如冗长的系统提示和详细的工具规范。尽管遵循此类指令对代理应用至关重要,但LLM是否能可靠地执行它们仍未得到充分探索。本文提出了AgentIF,这是首个系统评估LLM在代理场景中指令遵循能力的基准。AgentIF具有三个关键特征:(1)真实性,基于50个真实世界代理应用构建;(2)长度,平均1,723词,最长15,630词;(3)复杂性,每条指令平均包含11.9个约束,涵盖工具规范、条件约束等多样类型。为构建AgentIF,我们从工业应用代理和开源代理系统中收集了50个代理任务的707条人工标注指令,并为每条指令标注了相关约束及对应评估指标(包括基于代码的评估、基于LLM的评估以及混合代码-LLM评估)。通过AgentIF对现有先进LLM进行系统评估后,我们发现当前模型整体表现欠佳,尤其在处理复杂约束结构和工具规范时。我们进一步对指令长度和元约束进行了错误分析及实验研究,揭示了现有LLM的若干失效模式。相关代码和数据已开源以促进未来研究。
Transforming Decoder-Only Transformers for Accurate WiFi-Telemetry Based Indoor Localization
Abstract
arXiv:2505.15835v1 Announce Type: cross Abstract: Wireless Fidelity (WiFi) based indoor positioning is a widely researched area for determining the position of devices within a wireless network. Accurate indoor location has numerous applications, such as asset tracking and indoor navigation. Despite advances in WiFi localization techniques -- in particular approaches that leverage WiFi telemetry -- their adoption in practice remains limited due to several factors including environmental changes that cause signal fading, multipath effects, interference, which, in turn, impact positioning accuracy. In addition, telemetry data differs depending on the WiFi device vendor, offering distinct features and formats; use case requirements can also vary widely. Currently, there is no unified model to handle all these variations effectively. In this paper, we present WiFiGPT, a Generative Pretrained Transformer (GPT) based system that is able to handle these variations while achieving high localization accuracy. Our experiments with WiFiGPT demonstrate that GPTs, in particular Large Language Models (LLMs), can effectively capture subtle spatial patterns in noisy wireless telemetry, making them reliable regressors. Compared to existing state-of-the-art methods, our method matches and often surpasses conventional approaches for multiple types of telemetry. Achieving sub-meter accuracy for RSSI and FTM and centimeter-level precision for CSI demonstrates the potential of LLM-based localisation to outperform specialized techniques, all without handcrafted signal processing or calibration.
摘要
基于无线保真(WiFi)的室内定位是无线网络中设备位置确定领域的重要研究方向。精确的室内定位在资产追踪、室内导航等方面具有广泛应用。尽管WiFi定位技术(尤其是利用WiFi遥测的方法)取得了进展,但由于环境变化导致的信号衰减、多径效应、干扰等因素影响定位精度,其实际应用仍受限。此外,不同厂商的WiFi设备提供的遥测数据在特征和格式上存在差异,应用场景需求也各不相同。目前尚缺乏统一模型来有效处理这些变异性。本文提出WiFiGPT系统,该系统基于生成式预训练变换器(GPT),能够在处理这些变异性的同时实现高精度定位。实验表明,GPT(特别是大语言模型LLMs)能有效捕捉噪声无线遥测中的细微空间模式,成为可靠的回归器。与现有先进方法相比,我们的方法在多种遥测类型上达到或超越传统方案:针对RSSI和FTM实现亚米级精度,对CSI达到厘米级精度。这证明基于LLM的定位技术无需人工信号处理或校准即可超越专用技术。
UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Large Language Models
Abstract
arXiv:2505.14679v1 Announce Type: cross Abstract: Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model's internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose ULTRAEDIT-a fundamentally new editing solution that is training-, subject- and memory-free, making it particularly well-suited for ultra-scalable, real-world lifelong model editing. ULTRAEDIT performs editing through a self-contained process that relies solely on lightweight linear algebra operations to compute parameter shifts, enabling fast and consistent parameter modifications with minimal overhead. To improve scalability in lifelong settings, ULTRAEDIT employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. ULTRAEDIT achieves editing speeds over 7x faster than the previous state-of-the-art method-which was also the fastest known approach-while consuming less than 1/3 the VRAM, making it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct ULTRAEDITBENCH-the largest dataset in the field to date, with over 2M editing pairs-and demonstrate that our method supports up to 1M edits while maintaining high accuracy. Comprehensive experiments on four datasets and six models show that ULTRAEDIT consistently achieves superior performance across diverse model editing scenarios. Our code is available at: https://github.com/XiaojieGu/UltraEdit.
摘要
终身学习使大语言模型(LLMs)能够通过持续更新内部知识来适应不断变化的信息。理想的系统应支持高效、广泛的更新,同时保留现有能力并确保可靠部署。模型编辑作为实现这一目标的有前景方案脱颖而出,它提供了一种聚焦且高效的内部知识修订方式。尽管现有范式已取得显著进展,但往往难以满足大规模实际终身适应的需求。为填补这一空白,我们提出ULTRAEDIT——一种全新的编辑解决方案,其无需训练、不受主题限制且无需记忆存储,特别适合超大规模的现实世界终身模型编辑。ULTRAEDIT通过自包含流程执行编辑,仅依赖轻量级线性代数运算计算参数偏移,实现快速一致的参数修改且开销极小。为提升终身场景的可扩展性,ULTRAEDIT采用终身归一化策略持续更新跨轮次的特征统计量,使其能适应分布变化并保持长期一致性。ULTRAEDIT的编辑速度较先前最优方法(也是已知最快方法)提升7倍以上,同时VRAM消耗不足其1/3,成为目前唯一能在24GB消费级GPU上编辑70亿参数大模型的方法。此外,我们构建了该领域迄今最大数据集ULTRAEDITBENCH(含超200万编辑对),并证明本方法支持高达100万次编辑仍保持高精度。在四个数据集和六个模型上的全面实验表明,ULTRAEDIT在多样化模型编辑场景中均保持卓越性能。代码已开源:https://github.com/XiaojieGu/UltraEdit。
What Lives? A meta-analysis of diverse opinions on the definition of life
Abstract
arXiv:2505.15849v1 Announce Type: cross Abstract: The question of "what is life?" has challenged scientists and philosophers for centuries, producing an array of definitions that reflect both the mystery of its emergence and the diversity of disciplinary perspectives brought to bear on the question. Despite significant progress in our understanding of biological systems, psychology, computation, and information theory, no single definition for life has yet achieved universal acceptance. This challenge becomes increasingly urgent as advances in synthetic biology, artificial intelligence, and astrobiology challenge our traditional conceptions of what it means to be alive. We undertook a methodological approach that leverages large language models (LLMs) to analyze a set of definitions of life provided by a curated set of cross-disciplinary experts. We used a novel pairwise correlation analysis to map the definitions into distinct feature vectors, followed by agglomerative clustering, intra-cluster semantic analysis, and t-SNE projection to reveal underlying conceptual archetypes. This methodology revealed a continuous landscape of the themes relating to the definition of life, suggesting that what has historically been approached as a binary taxonomic problem should be instead conceived as differentiated perspectives within a unified conceptual latent space. We offer a new methodological bridge between reductionist and holistic approaches to fundamental questions in science and philosophy, demonstrating how computational semantic analysis can reveal conceptual patterns across disciplinary boundaries, and opening similar pathways for addressing other contested definitional territories across the sciences.
摘要
生命是什么?”这一问题几个世纪以来一直挑战着科学家和哲学家,产生了诸多定义,既反映了生命涌现的奥秘,也体现了跨学科视角的多样性。尽管我们在理解生物系统、心理学、计算及信息论方面取得了重大进展,但尚未形成一个被普遍接受的生命定义。随着合成生物学、人工智能和天体生物学的进步不断挑战传统生命概念的边界,这一挑战变得愈发紧迫。我们采用了一种基于大语言模型(LLMs)的方法论,通过分析跨学科专家提供的生命定义集,运用新型成对相关性分析将定义映射为特征向量,继而进行凝聚聚类、簇内语义分析和t-SNE降维投影,以揭示潜在的概念原型。该方法展现出一个连续的生命定义主题图谱,表明这个历史上被视为二元分类学的问题,应被重新理解为统一概念潜在空间中的差异化视角。我们为科学与哲学基础问题的还原论与整体论方法搭建了新的方法论桥梁,证明计算语义分析如何揭示跨学科的概念模式,并为解决科学界其他存在争议的定义领域开辟了类似路径。
AutoData: A Multi-Agent System for Open Web Data Collection
Abstract
arXiv:2505.15859v1 Announce Type: cross Abstract: The exponential growth of data-driven systems and AI technologies has intensified the demand for high-quality web-sourced datasets. While existing datasets have proven valuable, conventional web data collection approaches face significant limitations in terms of human effort and scalability. Current data-collecting solutions fall into two categories: wrapper-based methods that struggle with adaptability and reproducibility, and large language model (LLM)-based approaches that incur substantial computational and financial costs. To address these challenges, we propose AutoData, a novel multi-agent system for Automated web Data collection, that requires minimal human intervention, i.e., only necessitating a natural language instruction specifying the desired dataset. In addition, AutoData is designed with a robust multi-agent architecture, featuring a novel oriented message hypergraph coordinated by a central task manager, to efficiently organize agents across research and development squads. Besides, we introduce a novel hypergraph cache system to advance the multi-agent collaboration process that enables efficient automated data collection and mitigates the token cost issues prevalent in existing LLM-based systems. Moreover, we introduce Instruct2DS, a new benchmark dataset supporting live data collection from web sources across three domains: academic, finance, and sports. Comprehensive evaluations over Instruct2DS and three existing benchmark datasets demonstrate AutoData's superior performance compared to baseline methods. Case studies on challenging tasks such as picture book collection and paper extraction from surveys further validate its applicability. Our source code and dataset are available at https://github.com/GraphResearcher/AutoData.
摘要
数据驱动系统和AI技术的指数级增长加剧了对高质量网络源数据集的需求。尽管现有数据集已证明其价值,但传统网络数据收集方法在人力投入和可扩展性方面存在显著局限。当前数据收集方案分为两类:基于包装器的方法难以适应变化且可复现性差,而基于大语言模型(LLM)的方法则需承担高昂的计算与财务成本。为应对这些挑战,我们提出AutoData——一种新型自动化网络数据收集多智能体系统,仅需自然语言指令指定目标数据集即可运行,极大减少了人工干预。该系统采用鲁棒的多智能体架构,通过中央任务管理器协调的新型定向消息超图,高效组织研发团队中的智能体。此外,我们引入超图缓存系统以优化多智能体协作流程,既能实现高效自动化数据收集,又能缓解现有基于LLM系统的令牌成本问题。同时,我们提出Instruct2DS基准数据集,支持从学术、金融和体育三大领域网络源进行实时数据采集。在Instruct2DS及三个现有基准数据集上的综合评估表明,AutoData性能显著优于基线方法。针对图画书收集和综述文献提取等挑战性任务的案例研究进一步验证了其适用性。源代码与数据集详见https://github.com/GraphResearcher/AutoData。
GRIT: Teaching MLLMs to Think with Images
Abstract
arXiv:2505.15879v1 Announce Type: cross Abstract: Recent studies have demonstrated the efficacy of using Reinforcement Learning (RL) in building reasoning models that articulate chains of thoughts prior to producing final answers. However, despite ongoing advances that aim at enabling reasoning for vision-language tasks, existing open-source visual reasoning models typically generate reasoning content with pure natural language, lacking explicit integration of visual information. This limits their ability to produce clearly articulated and visually grounded reasoning chains. To this end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method for training MLLMs to think with images. GRIT introduces a grounded reasoning paradigm, in which models generate reasoning chains that interleave natural language and explicit bounding box coordinates. These coordinates point to regions of the input image that the model consults during its reasoning process. Additionally, GRIT is equipped with a reinforcement learning approach, GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused on the final answer accuracy and format of the grounded reasoning output, which eliminates the need for data with reasoning chain annotations or explicit bounding box labels. As a result, GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets. Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities.
摘要
近期研究表明,强化学习(RL)在构建推理模型方面具有显著效果,这类模型能在生成最终答案前明确表达思维链。然而,尽管当前研究不断推进视觉-语言任务的推理能力,现有开源视觉推理模型通常仅用纯自然语言生成推理内容,缺乏对视觉信息的显式整合。这导致其难以产生清晰表达且视觉可验证的推理链。为此,我们提出基于图像与文本的 grounded reasoning(GRIT)方法,通过新颖的训练方式使多模态大语言模型(MLLMs)实现图像化思考。GRIT 引入一种 grounded reasoning 范式,要求模型生成交替自然语言与显式边界框坐标的推理链,这些坐标指向模型推理过程中参考的输入图像区域。此外,GRIT 采用基于 GRPO 算法改进的强化学习方法 GRPO-GR,其奖励机制聚焦于最终答案准确性和 grounded reasoning 输出的格式规范,从而无需依赖带有推理链标注或显式边界框标签的数据。这使得 GRIT 具备卓越的数据效率,仅需从现有数据集中获取20个图像-问题-答案三元组即可完成训练。综合评估表明,GRIT 能有效训练 MLLMs 生成连贯且视觉可验证的推理链,成功实现了推理能力与 grounding 能力的统一。
Extracting Probabilistic Knowledge from Large Language Models for Bayesian Network Parameterization
Abstract
arXiv:2505.15918v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated potential as factual knowledge bases; however, their capability to generate probabilistic knowledge about real-world events remains understudied. This paper investigates using probabilistic knowledge inherent in LLMs to derive probability estimates for statements concerning events and their interrelationships captured via a Bayesian Network (BN). Using LLMs in this context allows for the parameterization of BNs, enabling probabilistic modeling within specific domains. Experiments on eighty publicly available Bayesian Networks, from healthcare to finance, demonstrate that querying LLMs about the conditional probabilities of events provides meaningful results when compared to baselines, including random and uniform distributions, as well as approaches based on next-token generation probabilities. We explore how these LLM-derived distributions can serve as expert priors to refine distributions extracted from minimal data, significantly reducing systematic biases. Overall, this work introduces a promising strategy for automatically constructing Bayesian Networks by combining probabilistic knowledge extracted from LLMs with small amounts of real-world data. Additionally, we evaluate several prompting strategies for eliciting probabilistic knowledge from LLMs and establish the first comprehensive baseline for assessing LLM performance in extracting probabilistic knowledge.
摘要
大型语言模型(LLMs)已展现出作为事实性知识库的潜力,但其生成关于现实世界事件的概率性知识的能力仍待深入研究。本文探讨如何利用LLMs内在的概率知识,对通过贝叶斯网络(BN)捕获的事件及其相互关系进行概率估计。在此背景下使用LLMs可实现BN的参数化,从而支持特定领域的概率建模。在涵盖医疗保健至金融等领域的八十个公开贝叶斯网络上进行的实验表明,与随机均匀分布基线及基于下一词生成概率的方法相比,通过LLMs查询事件条件概率可获得有意义的结果。我们进一步探究如何将这些LLM导出的分布作为专家先验,以优化从少量数据中提取的分布,显著减少系统性偏差。总体而言,本研究提出了一种通过结合LLMs提取的概率知识与少量真实数据来自动构建贝叶斯网络的有效策略。此外,我们评估了多种用于从LLMs中提取概率知识的提示策略,并建立了首个评估LLMs在概率知识提取性能方面的综合基线。
Causal Interventions Reveal Shared Structure Across English Filler-Gap Constructions
Abstract
arXiv:2505.16002v1 Announce Type: cross Abstract: Large Language Models (LLMs) have emerged as powerful sources of evidence for linguists seeking to develop theories of syntax. In this paper, we argue that causal interpretability methods, applied to LLMs, can greatly enhance the value of such evidence by helping us characterize the abstract mechanisms that LLMs learn to use. Our empirical focus is a set of English filler-gap dependency constructions (e.g., questions, relative clauses). Linguistic theories largely agree that these constructions share many properties. Using experiments based in Distributed Interchange Interventions, we show that LLMs converge on similar abstract analyses of these constructions. These analyses also reveal previously overlooked factors -- relating to frequency, filler type, and surrounding context -- that could motivate changes to standard linguistic theory. Overall, these results suggest that mechanistic, internal analyses of LLMs can push linguistic theory forward.
摘要
大型语言模型(LLMs)已成为语言学家构建句法理论时的重要证据来源。本文提出,通过对LLMs应用因果可解释性方法,能够通过揭示模型学习的抽象机制显著提升此类证据的价值。我们以英语填充语-空缺依存结构(如疑问句、关系从句)为实证研究对象。语言学理论普遍认为这些结构具有诸多共性。基于分布式互换干预的实验表明,LLMs对这些结构形成了相似的抽象分析。这些分析同时揭示了频率、填充语类型及上下文环境等被传统理论忽视的影响因素,可能推动标准语言学理论的修正。总体而言,研究结果证明对LLMs进行机制性内部分析能够促进语言学理论的发展。
Pre-training Large Memory Language Models with Internal and External Knowledge
Abstract
arXiv:2505.15962v1 Announce Type: cross Abstract: Neural language models are black-boxes -- both linguistic patterns and factual knowledge are distributed across billions of opaque parameters. This entangled encoding makes it difficult to reliably inspect, verify, or update specific facts. We propose a new class of language models, Large Memory Language Models (LMLM) with a pre-training recipe that stores factual knowledge in both internal weights and an external database. Our approach strategically masks externally retrieved factual values from the training loss, thereby teaching the model to perform targeted lookups rather than relying on memorization in model weights. Our experiments demonstrate that LMLMs achieve competitive performance compared to significantly larger, knowledge-dense LLMs on standard benchmarks, while offering the advantages of explicit, editable, and verifiable knowledge bases. This work represents a fundamental shift in how language models interact with and manage factual knowledge.
摘要
神经语言模型是黑箱系统——无论是语言模式还是事实知识,都分布在数十亿个不透明的参数中。这种纠缠的编码方式使得可靠地检查、验证或更新特定事实变得困难。我们提出了一类新型语言模型,即具有大记忆的语言模型(LMLM),其预训练方案将事实知识同时存储在内部权重和外部数据库中。我们的方法策略性地屏蔽了训练损失中从外部检索到的事实值,从而教导模型执行定向查询,而非依赖模型权重的记忆。实验表明,与规模更大、知识密集的大型语言模型(LLM)相比,LMLM在标准基准测试中实现了具有竞争力的性能,同时提供了显式、可编辑和可验证的知识库优势。这项工作代表了语言模型与事实知识交互和管理方式的根本性转变。
VERDI: VLM-Embedded Reasoning for Autonomous Driving
Abstract
arXiv:2505.15925v1 Announce Type: cross Abstract: While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous Driving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, \textsc{VERDI} enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We demonstrate the effectiveness of our method on the NuScenes dataset and find that VERDI outperforms existing e2e methods that do not embed reasoning by 10% in \ell_{2} distance, while maintaining high inference speed.
摘要
尽管自动驾驶(AD)系统在部分可观测性和现实世界复杂性下的决策面临挑战,人类驾驶员却能够通过常识推理在有限信息下做出近乎最优的决策。近期研究尝试利用微调的视觉语言模型(VLMs)在推理阶段进行轨迹规划以模拟人类行为。尽管这些方法在基准评估中取得了成功,但其部署往往不切实际(一个700亿参数的VLM以每秒仅8个令牌的推理速度需要超过160G内存),且其整体式网络结构阻碍了安全性分解。为弥合这一差距,我们提出用于自动驾驶的VLM嵌入式推理框架(VERDI),该训练时框架将VLMs的推理过程和常识知识蒸馏至AD系统中。VERDI通过将感知、预测和规划阶段的中间模块输出与VLMs生成的驾驶推理过程文本特征对齐,增强了模块化可微分端到端(e2e)AD模型。通过在潜在空间实现对齐,VERDI使模块化AD系统能够内化结构化推理,而无需承担大型VLMs的推理时成本。我们在NuScenes数据集上验证了方法的有效性,发现VERDI在ℓ2距离上优于未嵌入推理的现有e2e方法10%,同时保持较高的推理速度。
Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey
Abstract
arXiv:2505.15957v1 Announce Type: cross Abstract: With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.
摘要
随着大型音频-语言模型(LALMs)的发展——这类模型通过增强大型语言模型(LLMs)的听觉能力,被期望在各种听觉任务中展现出普适性能力。尽管已有大量基准测试涌现以评估LALMs的性能,但这些评估仍处于碎片化状态且缺乏系统化的分类体系。为填补这一空白,我们开展了全面调研并提出了一套LALM评估的系统分类法,根据其目标将评估划分为四个维度:(1)通用听觉感知与处理能力,(2)知识与推理能力,(3)对话导向能力,以及(4)公平性、安全性与可信度。我们对每个类别进行了详细概述,并指出了该领域面临的挑战,为未来研究方向提供了前瞻性见解。据我们所知,这是首个专门针对LALMs评估的调研工作,为学界提供了清晰的指导框架。我们将公开所调研文献的汇总集合并持续维护,以支持该领域的持续发展。
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations
Abstract
arXiv:2505.16004v1 Announce Type: cross Abstract: Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios without notably affecting the outputs of the base LLMs themselves. Overall, our results suggest that SAE concept representations are fragile and may be ill-suited for applications in model monitoring and oversight.
摘要
稀疏自编码器(SAEs)通常用于通过将大型语言模型(LLMs)的内部激活映射到人类可解释的概念表示来解析其工作机制。现有对SAEs的评估主要关注重建-稀疏性权衡、人类(自动)可解释性及特征解耦等指标,却忽视了一个关键维度:概念表示对输入扰动的鲁棒性。我们认为鲁棒性必须作为概念表示的基本考量,因其反映了概念标注的保真度。为此,我们将鲁棒性量化问题建模为输入空间优化问题,并开发了一个包含现实场景的综合评估框架——这些场景中生成的对抗性扰动可操纵SAE的表示。实证研究表明,在大多数情况下,微小的对抗性输入扰动即可有效操纵基于概念的解释,而不会显著影响底层LLM的输出。总体而言,我们的结果表明SAE的概念表示具有脆弱性,可能不适合应用于模型监控与监督场景。
SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models
Abstract
arXiv:2505.16003v1 Announce Type: cross Abstract: The LLM-as-a-Judge paradigm offers a scalable, reference-free approach for evaluating language models. Although several calibration techniques have been proposed to better align these evaluators with human judgment, prior studies focus primarily on narrow, well-structured benchmarks. As a result, it remains unclear whether such calibrations generalize to real-world, open-ended tasks. In this work, we show that SOTA calibrated evaluators often fail in these settings, exhibiting weak or even negative correlation with human judgments. To address this, we propose SLMEval, a novel and efficient calibration method based on entropy maximization over a small amount of human preference data. By estimating a latent distribution over model quality and reweighting evaluator scores accordingly, SLMEval achieves strong correlation with human evaluations across two real-world production use cases and the public benchmark. For example, on one such task, SLMEval achieves a Spearman correlation of 0.57 with human judgments, while G-Eval yields a negative correlation. In addition, SLMEval reduces evaluation costs by 5-30x compared to GPT-4-based calibrated evaluators such as G-eval.
摘要
LLM-as-a-Judge范式为评估语言模型提供了一种可扩展、无参考的解决方案。尽管已有多种校准技术被提出以更好地使这些评估者与人类判断保持一致,但先前研究主要集中于狭窄、结构化的基准测试。因此,这类校准是否适用于现实世界中开放式的任务仍不明确。本研究显示,当前最先进的校准评估器在此类场景中往往失效,与人类判断呈现弱相关甚至负相关。为此,我们提出SLMEval——一种基于熵最大化的新型高效校准方法,仅需少量人类偏好数据。通过估计模型质量的潜在分布并相应调整评估分数权重,SLMEval在两个现实生产用例和公共基准测试中均实现了与人类评估的强相关性。例如,在某项任务中,SLMEval获得0.57的斯皮尔曼相关系数,而G-Eval则呈现负相关。此外,相较于基于GPT-4的校准评估器(如G-eval),SLMEval将评估成本降低了5至30倍。
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Abstract
arXiv:2505.15966v1 Announce Type: cross Abstract: Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84% on V* bench, 74% on TallyQA-Complex, and 84% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.
摘要
思维链推理显著提升了大型语言模型(LLMs)在多个领域的性能表现。然而,该推理过程此前仅局限于文本空间,这限制了其在视觉密集型任务中的有效性。为解决这一局限,我们提出了像素空间推理的新概念。在此创新框架下,视觉语言模型(VLMs)被赋予一系列视觉推理操作(如局部放大和帧选择),使其能够直接对视觉证据进行检视、质询与推断,从而提升视觉任务的推理保真度。培养VLMs的像素空间推理能力面临两大挑战:模型初始能力的不均衡性及对新引入像素空间操作的抵触。我们通过两阶段训练方法应对这些挑战:第一阶段采用合成推理轨迹的指令微调,使模型熟悉新型视觉操作;随后通过强化学习(RL)阶段,利用好奇心驱动的奖励机制平衡像素空间推理与文本推理的探索。借助这些视觉操作,VLMs能够与信息密集的图像或视频等复杂视觉输入进行交互,主动收集必要信息。实验表明,该方法在多种视觉推理基准测试中显著提升了VLM性能。我们的7B参数模型在V* bench上达到84%准确率,TallyQA-Complex达74%,InfographicsVQA达84%,创下当前开源模型的最高精度记录。这些结果印证了像素空间推理的重要性及本框架的有效性。
NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning
Abstract
arXiv:2505.16022v1 Announce Type: cross Abstract: Recent advances such as DeepSeek R1-Zero highlight the effectiveness of incentive training, a reinforcement learning paradigm that computes rewards solely based on the final answer part of a language model's output, thereby encouraging the generation of intermediate reasoning steps. However, these methods fundamentally rely on external verifiers, which limits their applicability to domains like mathematics and coding where such verifiers are readily available. Although reward models can serve as verifiers, they require high-quality annotated data and are costly to train. In this work, we propose NOVER, NO-VERifier Reinforcement Learning, a general reinforcement learning framework that requires only standard supervised fine-tuning data with no need for an external verifier. NOVER enables incentive training across a wide range of text-to-text tasks and outperforms the model of the same size distilled from large reasoning models such as DeepSeek R1 671B by 7.7 percent. Moreover, the flexibility of NOVER enables new possibilities for optimizing large language models, such as inverse incentive training.
摘要
DeepSeek R1-Zero等最新进展凸显了激励训练的有效性,这是一种强化学习范式,其奖励仅基于语言模型输出的最终答案部分进行计算,从而鼓励生成中间推理步骤。然而,这些方法从根本上依赖于外部验证器,限制了其在数学和编程等验证器易于获取的领域的适用性。尽管奖励模型可作为验证器,但它们需要高质量标注数据且训练成本高昂。本研究提出NOVER(无验证器强化学习),这是一种通用强化学习框架,仅需标准监督微调数据而无需外部验证器。NOVER能够在广泛的文本到文本任务中实现激励训练,其性能比从DeepSeek R1 671B等大型推理模型蒸馏出的同规模模型高出7.7%。此外,NOVER的灵活性为优化大语言模型提供了新可能性,例如逆向激励训练。
Merge to Mix: Mixing Datasets via Model Merging
Abstract
arXiv:2505.16066v1 Announce Type: cross Abstract: Mixing datasets for fine-tuning large models (LMs) has become critical for maximizing performance on downstream tasks. However, composing effective dataset mixtures typically relies on heuristics and trial-and-error, often requiring multiple fine-tuning runs to achieve the desired outcome. We propose a novel method, \textit{Merge to Mix}, that accelerates composing dataset mixtures through model merging. Model merging is a recent technique that combines the abilities of multiple individually fine-tuned LMs into a single LM by using a few simple arithmetic operations. Our key insight is that merging models individually fine-tuned on each dataset in a mixture can effectively serve as a surrogate for a model fine-tuned on the entire mixture. Merge to Mix leverages this insight to accelerate selecting dataset mixtures without requiring full fine-tuning on each candidate mixture. Our experiments demonstrate that Merge to Mix surpasses state-of-the-art methods in dataset selection for fine-tuning LMs.
摘要
混合数据集以微调大模型(LMs)已成为提升下游任务性能的关键方法。然而,构建有效的数据集混合通常依赖于启发式方法和试错过程,往往需要多次微调才能达到预期效果。我们提出了一种新方法—— extit{合并混合法}(Merge to Mix),通过模型合并加速数据集混合的构建。模型合并是一种新兴技术,通过简单的算术运算将多个单独微调的LMs能力整合到单一模型中。我们的核心发现是:对混合数据集中每个数据集单独微调的模型进行合并,可有效替代对整个混合数据集微调的模型。合并混合法利用这一发现加速数 据集选择,无需对每个候选混合进行完整微调。实验表明,在微调LMs的数据集选择任务中,合并混合法优于当前最先进方法。
Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
Abstract
arXiv:2505.16056v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce expert offloading that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this local routing consistency varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) Segment Routing Best Performance (SRP), which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) Segment Cache Best Hit Rate (SCH), which measures the optimal segment-level cache hit rate under a given cache size limit. We analyzed 20 MoE LLMs with diverse sizes and architectures and found that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency. We further showed that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models can balance between cache effectiveness and efficiency with cache sizes approximately 2x the active experts. These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed. We publish the code for replicating experiments at https://github.com/ljcleo/moe-lrc .
摘要
混合专家(MoE)技术通过推理过程中稀疏激活专家模块,实现了大语言模型(LLM)的高效扩展。为在内存受限设备上有效部署大型MoE模型,现有系统多采用专家卸载策略——将部分专家缓存于高速内存,其余专家保留在低速内存中通过CPU运行或按需加载。尽管已有研究利用专家激活的局部性(连续token倾向于激活相似专家),但这种局部路由一致性的程度因模型而异且研究不足。本文提出两项指标量化MoE模型的局部路由一致性:(1) 分段路由最优性能(SRP),评估固定专家组覆盖token片段需求的能力;(2) 分段缓存最优命中率(SCH),衡量给定缓存容量限制下的最优分段级缓存命中率。通过对20个不同规模与架构的MoE LLM进行分析,我们发现每层均应用MoE且未使用共享专家的模型表现出最高的局部路由一致性。进一步研究表明:领域专用专家对路由一致性的贡献大于词汇专用专家,且多数模型在缓存容量约为激活专家数2倍时可平衡缓存效率与效果。这些发现为不影响推理速度的内存高效MoE设计与部署提供了理论基础。实验复现代码发布于https://github.com/ljcleo/moe-lrc。
Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning
Abstract
arXiv:2505.16088v1 Announce Type: cross Abstract: Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future regimes; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year month day).
摘要
现代BPE分词器常将日期分割为无意义的片段(如20250312→202、503、12),导致标记数量膨胀并破坏稳健时间推理所需的内在结构。本研究提出:(1)一种简单可解释的度量指标——日期碎片化比率,用于评估分词器保留多位数日期成分的保真度;(2)发布DateAugBench基准测试集,包含6500个样本,涵盖基于上下文的日期解析、格式无关难题及跨越历史/当代/未来时期的日期运算三大时序推理任务;(3)通过分层探测与因果注意力跳分析,揭示大语言模型通过拼接年月日碎片进行时序推理的涌现式日期抽象机制。实验表明,过度碎片化会导致历史/未来等非常见日期上的准确率下降达10个百分点。模型规模越大,其修复日期碎片的涌现抽象能力形成越快。研究还发现大模型组装日期碎片的推理路径(年→月→日)通常与人类理解方式存在差异。
Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation
Abstract
arXiv:2505.16146v1 Announce Type: cross Abstract: Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks such as visual question answering (VQA) and image captioning. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications. Existing approaches to address this issue focus on incorporating external knowledge bases, alignment training, or decoding strategies, all of which require substantial computational cost and time. Recent works try to explore more efficient alternatives by adjusting LVLMs' internal representations. Although promising, these methods may cause hallucinations to be insufficiently suppressed or lead to excessive interventions that negatively affect normal semantics. In this work, we leverage sparse autoencoders (SAEs) to identify semantic directions closely associated with either hallucinations or actuality, realizing more precise and direct hallucination-related representations. Our analysis demonstrates that interventions along the faithful direction we identified can mitigate hallucinations, while those along the hallucinatory direction can exacerbate them. Building on these insights, we propose Steering LVLMs via SAE Latent Directions (SSL), a training-free method based on SAE-derived latent directions to mitigate hallucinations in LVLMs. Extensive experiments demonstrate that SSL significantly outperforms existing decoding approaches in mitigating hallucinations, while maintaining transferability across different model architectures with negligible additional time overhead.
摘要
大型视觉语言模型(LVLMs)在多模态任务(如视觉问答和图像描述生成)中展现出卓越性能,但仍存在幻觉问题——生成与视觉输入不一致的文本,这对实际应用构成重大风险。现有解决方法主要依赖外部知识库整合、对齐训练或解码策略,这些方法均需高昂计算成本和时间消耗。近期研究尝试通过调整LVLMs内部表征来探索更高效的替代方案,虽然前景可观,但这些方法可能导致幻觉抑制不足或产生过度干预,进而损害正常语义表达。本研究利用稀疏自编码器(SAEs)识别与幻觉或真实性紧密关联的语义方向,实现更精准直接的幻觉相关表征定位。分析表明,沿我们识别的可信方向进行干预可缓解幻觉,而沿幻觉方向干预则会加剧该现象。基于此,我们提出SAE潜在方向引导法(SSL),这是一种基于SAE派生潜在方向的免训练方法,用于抑制LVLMs的幻觉生成。大量实验证明,SSL在减轻幻觉方面显著优于现有解码方法,同时保持跨模型架构的可迁移性,且附加时间开销可忽略不计。
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
Abstract
arXiv:2505.16175v1 Announce Type: cross Abstract: Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.
摘要
长视频理解在视频监控、会议摘要、教学讲座分析和体育赛事转播等实际应用中已成为关键能力。然而,由于两大瓶颈问题,当前视频大语言模型仍面临巨大计算负担:1)顺序视频解码——将原始比特流转换为RGB帧的过程对小时级视频输入可能耗时长达一分钟;2)高达数百万token的昂贵预填充导致LLM推理延迟高且内存占用大。为解决这些挑战,我们提出QuickVideo系统-算法协同设计方案,通过三大核心创新显著加速长视频理解以支持实时下游应用:QuickDecoder采用基于CPU的并行视频解码器,通过将视频分割为关键帧对齐区间并发处理,实现2-3倍加速;QuickPrefill运用KV缓存剪枝的内存高效预填充方法,以更少GPU内存支持更多帧处理;以及重叠调度方案实现CPU视频解码与GPU推理的并行执行。这些组件共同将长视频输入的推理时间缩短一分钟,使有限硬件条件下仍可进行高质量、可扩展的视频理解。实验表明QuickVideo能适应不同时长和采样率,使长视频处理在实践中具备可行性。
NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics
Abstract
arXiv:2505.16210v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks. However, LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands, which significantly increases the memory resource consumption of the Key-Value (KV) cache during inference, becoming a major bottleneck in LLM deployment. To address this issue, quantization is a common and straightforward approach. Currently, quantization methods for activations are limited to 8-bit, and quantization to even lower bits can lead to substantial accuracy drops. To further save space by quantizing the KV cache to even lower bits, we analyzed the element distribution of the KV cache and designed the NQKV algorithm. Since the elements within each block of the KV cache follow a normal distribution, NQKV employs per-block quantile quantization to achieve information-theoretically optimal quantization error. Without significantly compromising model output quality, NQKV enables the OPT model to perform inference with an 2x larger batch size or a 4x longer context length, and it improves throughput by 9.3x compared to when the KV cache is not used.
摘要
大型语言模型(LLMs)在广泛任务中展现出卓越性能。然而,LLMs通常需要更大批处理量以提升吞吐率,或更长上下文长度以满足任务需求,这显著增加了推理过程中键值(KV)缓存的内存资源消耗,成为LLM部署的主要瓶颈。为解决该问题,量化是一种常见且直接的方法。当前激活函数的量化方法仅限于8位,更低比特的量化会导致精度显著下降。为通过将KV缓存量化至更低比特进一步节省空间,我们分析了KV缓存的元素分布并设计NQKV算法。由于KV缓存每个区块内元素服从正态分布,NQKV采用逐区块分位数量化以实现信息论最优量化误差。在不显著影响模型输出质量的前提下,NQKV使OPT模型能以2倍批处理量或4倍上下文长度进行推理,与未使用KV缓存时相比,吞吐率提升达9.3倍。
Explain Less, Understand More: Jargon Detection via Personalized Parameter-Efficient Fine-tuning
Abstract
arXiv:2505.16227v1 Announce Type: cross Abstract: Personalizing jargon detection and explanation is essential for making technical documents accessible to readers with diverse disciplinary backgrounds. However, tailoring models to individual users typically requires substantial annotation efforts and computational resources due to user-specific finetuning. To address this, we present a systematic study of personalized jargon detection, focusing on methods that are both efficient and scalable for real-world deployment. We explore two personalization strategies: (1) lightweight fine-tuning using Low-Rank Adaptation (LoRA) on open-source models, and (2) personalized prompting, which tailors model behavior at inference time without retaining. To reflect realistic constraints, we also investigate hybrid approaches that combine limited annotated data with unsupervised user background signals. Our personalized LoRA model outperforms GPT-4 by 21.4% in F1 score and exceeds the best performing oracle baseline by 8.3%. Remarkably, our method achieves comparable performance using only 10% of the annotated training data, demonstrating its practicality for resource-constrained settings. Our study offers the first work to systematically explore efficient, low-resource personalization of jargon detection using open-source language models, offering a practical path toward scalable, user-adaptive NLP system.
摘要
个性化术语检测与解释对于使技术文档适应不同学科背景的读者至关重要。然而,针对个体用户定制模型通常需要大量标注工作和计算资源,因为涉及用户特定的微调。为此,我们系统研究了个性化术语检测方法,重点关注实际部署中高效且可扩展的方案。我们探索了两种个性化策略:(1)基于开源模型采用低秩自适应(LoRA)的轻量级微调;(2)无需保留参数的个性化提示方法,在推理阶段调整模型行为。为反映现实约束,我们还研究了将有限标注数据与无监督用户背景信号相结合的混合方法。实验表明,我们的个性化LoRA模型F1分数比GPT-4高出21.4%,较最佳基准模型提升8.3%。值得注意的是,该方法仅需10%标注训练数据即可达到相当性能,证明了其在资源受限场景下的实用性。本研究首次系统探索了基于开源语言模型的高效、低资源个性化术语检测方案,为构建可扩展的用户自适应NLP系统提供了可行路径。
VLM-R: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
Abstract
arXiv:2505.16192v1 Announce Type: cross Abstract: Recently, reasoning-based MLLMs have achieved a degree of success in generating long-form textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on and revisiting of visual regions to achieve precise grounding of textual reasoning in visual evidence. We introduce \textbf{VLM-R} (\textbf{V}isual \textbf{L}anguage \textbf{M}odel with \textbf{R}egion \textbf{R}ecognition and \textbf{R}easoning), a framework that equips an MLLM with the ability to (i) decide \emph{when} additional visual evidence is needed, (ii) determine \emph{where} to ground within the image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved chain-of-thought. The core of our method is \textbf{Region-Conditioned Reinforcement Policy Optimization (R-GRPO)}, a training paradigm that rewards the model for selecting informative regions, formulating appropriate transformations (e.g.\ crop, zoom), and integrating the resulting visual context into subsequent reasoning steps. To bootstrap this policy, we compile a modest but carefully curated Visuo-Lingual Interleaved Rationale (VLIR) corpus that provides step-level supervision on region selection and textual justification. Extensive experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R sets a new state of the art in zero-shot and few-shot settings, with the largest gains appearing on questions demanding subtle spatial reasoning or fine-grained visual cue extraction.
摘要
近年来,基于推理的多模态大语言模型(MLLMs)在生成长篇文本推理链方面取得了一定成功。然而,面对需要动态迭代聚焦并重新审视视觉区域以实现文本推理与视觉证据精准对接的复杂任务时,现有模型仍存在不足。我们提出\textbf{VLM-R}(具备区域识别与推理能力的视觉语言模型),该框架使MLLM能够:(i)判断\emph{何时}需要补充视觉证据;(ii)确定图像中的\emph{何处}进行定位;(iii)将相关子图像内容无缝编织至交错的思维链中。方法的核心是\textbf{区域条件强化策略优化(R-GRPO)},该训练范式通过奖励模型选择信息性区域、制定适当变换(如裁剪、缩放)并将生成的视觉上下文整合至后续推理步骤来实现优化。为引导该策略,我们构建了精炼的视觉语言交错理论(VLIR)语料库,提供区域选择与文本论证的步骤级监督。在MathVista、ScienceQA等基准上的大量实验表明,VLM-R在零样本和少样本设置下创造了新的技术标杆,尤其在需要精细空间推理或细粒度视觉线索提取的问题上表现最为突出。
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Abstract
arXiv:2505.16211v1 Announce Type: cross Abstract: The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust.
摘要
音频大语言模型(ALLMs)的快速发展和广泛应用亟需对其可信度进行严格评估。然而,针对此类模型的系统性研究,尤其是涉及音频模态特有风险的评估仍处于空白状态。现有评估框架主要集中于文本模态或仅涵盖有限的安全维度,未能充分考虑音频模态的独特特性和应用场景。本文提出AudioTrust——首个专为ALLMs设计的多元化可信度评估框架与基准测试平台,该框架涵盖公平性、幻觉、安全性、隐私性、鲁棒性和真实性六大核心维度。为实现全面评估,AudioTrust构建了18种实验场景,其核心是基于4,420个真实场景(如日常对话、紧急呼叫、语音助手交互)的音频/文本样本库,专门用于探究ALLMs的多维可信度。评估方面,本基准精心设计了9项音频专用指标,并采用大规模自动化流程对模型输出进行客观可扩展的评分。实验结果表明,当前最先进的开源与闭源ALLMs在面对各类高风险音频场景时存在的可信度边界与局限性,为未来音频模型的安全可信部署提供了重要参考。我们的平台与基准测试已开源:https://github.com/JusperLee/AudioTrust。
DualComp: End-to-End Learning of a Unified Dual-Modality Lossless Compressor
Abstract
arXiv:2505.16256v1 Announce Type: cross Abstract: Most learning-based lossless compressors are designed for a single modality, requiring separate models for multi-modal data and lacking flexibility. However, different modalities vary significantly in format and statistical properties, making it ineffective to use compressors that lack modality-specific adaptations. While multi-modal large language models (MLLMs) offer a potential solution for modality-unified compression, their excessive complexity hinders practical deployment. To address these challenges, we focus on the two most common modalities, image and text, and propose DualComp, the first unified and lightweight learning-based dual-modality lossless compressor. Built on a lightweight backbone, DualComp incorporates three key structural enhancements to handle modality heterogeneity: modality-unified tokenization, modality-switching contextual learning, and modality-routing mixture-of-experts. A reparameterization training strategy is also used to boost compression performance. DualComp integrates both modality-specific and shared parameters for efficient parameter utilization, enabling near real-time inference (200KB/s) on desktop CPUs. With much fewer parameters, DualComp achieves compression performance on par with the SOTA LLM-based methods for both text and image datasets. Its simplified single-modality variant surpasses the previous best image compressor on the Kodak dataset by about 9% using just 1.2% of the model size.
摘要
大多数基于学习的无损压缩方法针对单一模态设计,需为多模态数据建立独立模型且缺乏灵活性。然而不同模态在格式与统计特性上差异显著,缺乏模态适配的压缩器效果欠佳。虽然多模态大语言模型(MLLMs)为模态统一压缩提供了潜在解决方案,但其过高复杂度阻碍了实际部署。为解决这些问题,我们聚焦图像与文本两大常见模态,提出首个统一、轻量化的双模态无损压缩器DualComp。该模型基于轻量级主干网络,通过三项关键结构改进处理模态异质性:模态统一标记化、模态切换上下文学习及模态路由专家混合机制,并采用重参数化训练策略提升压缩性能。DualComp通过模态专用参数与共享参数的高效协同,在桌面CPU上实现近实时推理(200KB/s)。其参数量大幅减少的同时,在文本和图像数据集上的压缩性能与基于LLM的最先进方法相当。其简化单模态变体仅用1.2%的模型尺寸,便在Kodak数据集上以约9%的优势超越此前最佳图像压缩器。
LIFEBench: Evaluating Length Instruction Following in Large Language Models
Abstract
arXiv:2505.16234v1 Announce Type: cross Abstract: While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally, models often generate far too short outputs, terminate prematurely, or even refuse the request. Existing benchmarks focus primarily on evaluating generations quality, but often overlook whether the generations meet length constraints. To this end, we introduce Length Instruction Following Evaluation Benchmark (LIFEBench) to comprehensively evaluate LLMs' ability to follow length instructions across diverse tasks and a wide range of specified lengths. LIFEBench consists of 10,800 instances across 4 task categories in both English and Chinese, covering length constraints ranging from 16 to 8192 words. We evaluate 26 widely-used LLMs and find that most models reasonably follow short-length instructions but deteriorate sharply beyond a certain threshold. Surprisingly, almost all models fail to reach the vendor-claimed maximum output lengths in practice, as further confirmed by our evaluations extending up to 32K words. Even long-context LLMs, despite their extended input-output windows, counterintuitively fail to improve length-instructions following. Notably, Reasoning LLMs outperform even specialized long-text generation models, achieving state-of-the-art length following. Overall, LIFEBench uncovers fundamental limitations in current LLMs' length instructions following ability, offering critical insights for future progress.
摘要
尽管大语言模型(LLMs)能够解决涉及长上下文输入的博士级推理问题,但它们在一个看似更简单的任务上却表现不佳:遵循显式长度指令——例如撰写一篇10,000字的小说。此外,模型生成的输出往往过短、提前终止,甚至直接拒绝请求。现有基准主要评估生成质量,但常常忽略生成内容是否满足长度约束。为此,我们引入了长度指令遵循评估基准(LIFEBench),以全面评估LLMs在不同任务和广泛指定长度范围内遵循长度指令的能力。LIFEBench包含10,800个实例,涵盖4个任务类别,支持中英双语,长度约束范围从16到8192字。我们对26个广泛使用的LLMs进行了评估,发现大多数模型能较好地遵循短长度指令,但超过特定阈值后性能急剧下降。令人惊讶的是,几乎所有模型在实际应用中均未能达到厂商宣称的最大输出长度,这一结论在我们扩展至32K字的评估中得到了进一步验证。即使是长上下文LLMs,尽管其输入输出窗口有所扩展,反直觉地未能提升长度指令遵循能力。值得注意的是,推理型LLMs的表现甚至优于专门的长文本生成模型,实现了最先进的长度指令遵循水平。总体而言,LIFEBench揭示了当前LLMs在长度指令遵循能力上的根本局限,为未来进展提供了关键洞见。
Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning
Abstract
arXiv:2505.16270v1 Announce Type: cross Abstract: Large language models are typically adapted to downstream tasks through supervised fine-tuning on domain-specific data. While standard fine-tuning focuses on minimizing generation loss to optimize model parameters, we take a deeper step by retaining and leveraging the model's own learning signals, analogous to how human learners reflect on past mistakes to improve future performance. We first introduce the concept of Mistake Log to systematically track the model's learning behavior and recurring errors throughout fine-tuning. Treating the original transformer-based model as the Pilot, we correspondingly design a Copilot model to refine the Pilot's inference performance via logits rectification. We name the overall Pilot-Copilot framework the Transformer Copilot, which introduces (i) a novel Copilot model design, (ii) a joint training paradigm where the Copilot continuously learns from the evolving Mistake Log alongside the Pilot, and (iii) a fused inference paradigm where the Copilot rectifies the Pilot's logits for enhanced generation. We provide both theoretical and empirical analyses on our new learning framework. Experiments on 12 benchmarks spanning commonsense, arithmetic, and recommendation tasks demonstrate that Transformer Copilot consistently improves performance by up to 34.5%, while introducing marginal computational overhead to Pilot models and exhibiting strong scalability and transferability.
摘要
大语言模型通常通过对领域特定数据进行监督微调来适应下游任务。传统微调方法主要关注最小化生成损失以优化模型参数,而我们更进一步:通过保留并利用模型自身的学习信号,模拟人类通过反思过往错误来提升未来表现的学习机制。首先,我们提出"错误日志"概念,用于系统追踪模型在微调过程中的学习行为与重复性错误。将原始基于Transformer的模型视为"领航模型",相应设计"协航模型"通过logits校正来优化领航模型的推理性能。该整体框架被命名为Transformer协航系统,其创新性体现在:(1)新型协航模型架构;(2)协航模型与领航模型同步训练,持续从动态更新的错误日志中学习的联合训练范式;(3)通过协航模型校正领航模型logits以提升生成质量的融合推理范式。我们对该学习框架进行了理论与实证分析。在涵盖常识推理、算术运算和推荐系统等12个基准测试上的实验表明,Transformer协航系统最高可提升34.5%的性能表现,且仅对领航模型引入边际计算开销,同时展现出优异的可扩展性与迁移能力。
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
Abstract
arXiv:2505.16278v1 Announce Type: cross Abstract: End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-. Specifically, we add Vision MoE to Drive- by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-.
摘要
端到端自动驾驶(E2E-AD)需要有效处理多视角传感数据,并稳健应对多样复杂的驾驶场景,尤其是激进转弯等罕见操作。混合专家(MoE)架构在大型语言模型(LLM)中的成功表明,参数专业化可实现强大扩展性。本研究提出DriveMoE——一种基于MoE的新型E2E-AD框架,包含场景专业化视觉MoE与技能专业化动作MoE。该框架基于我们原有的具身AI领域Vision-Language-Action(VLA)基线模型Drive-π₀构建。具体而言,我们通过训练动态路由网络根据驾驶上下文选择相关摄像头,为Drive-π₀添加视觉MoE模块。该设计模拟人类驾驶认知机制,即驾驶员选择性关注关键视觉线索而非穷尽处理所有视觉信息。此外,我们通过训练另一路由网络激活不同驾驶行为的专用专家模块,构建动作MoE。通过显式的行为专业化设计,DriveMoE能应对多样化场景,避免现有模型的模式平均问题。Bench2Drive闭环评估实验表明,DriveMoE取得最先进(SOTA)性能,验证了视觉与动作MoE组合在自动驾驶任务中的有效性。我们将公开DriveMoE及Drive-π₀的代码与模型。
PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models
Abstract
arXiv:2505.16307v1 Announce Type: cross Abstract: Prompt optimization offers a practical and broadly applicable alternative to fine-tuning for improving large language model (LLM) performance. However, existing methods often rely on costly output generation, self-critiquing abilities, or human-annotated preferences, which limit their scalability, especially for smaller or non-instruction-tuned models. We introduce PMPO (Probabilistic Metric Prompt Optimization), a unified framework that refines prompts using token-level cross-entropy loss as a direct, lightweight evaluation signal. PMPO identifies low-quality prompt segments by masking and measuring their impact on loss, then rewrites and selects improved variants by minimizing loss over positive and negative examples. Unlike prior methods, it requires no output sampling or human evaluation during optimization, relying only on forward passes and log-likelihoods. PMPO supports both supervised and preference-based tasks through a closely aligned loss-based evaluation strategy. Experiments show that PMPO consistently outperforms prior methods across model sizes and tasks: it achieves the highest average accuracy on BBH, performs strongly on GSM8K and AQUA-RAT, and improves AlpacaEval 2.0 win rates by over 19 points. These results highlight PMPO's effectiveness, efficiency, and broad applicability.
摘要
提示优化为提高大语言模型(LLM)性能提供了一种实用且广泛适用的替代方案,相较于微调方法。然而,现有技术通常依赖于高成本的输出生成、自我批判能力或人工标注的偏好数据,这限制了其可扩展性,尤其对于较小或未经指令微调的模型。本文提出概率度量提示优化框架PMPO,该框架通过使用词元级交叉熵损失作为直接、轻量级的评估信号来优化提示。PMPO通过掩码处理识别低质量提示片段并量化其对损失函数的影响,随后通过最小化正负样本的损失值来重写和筛选改进版本。与现有方法不同,该技术优化过程中无需输出采样或人工评估,仅需前向传播和对数似然计算。基于损失函数的紧密对齐评估策略,PMPO可同时支持监督学习和偏好导向任务。实验表明,PMPO在不同模型规模和任务中均优于现有方法:在BBH基准上取得最高平均准确率,在GSM8K和AQUA-RAT任务中表现优异,并将AlpacaEval 2.0胜率提升超过19个百分点。这些结果充分证明了PMPO框架的高效性、有效性及广泛适用性。
AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners
Abstract
arXiv:2505.16322v1 Announce Type: cross Abstract: Self-Taught Reasoners (STaR), synonymously known as Rejection sampling Fine-Tuning (RFT), is an integral part of the training pipeline of self-improving reasoning Language Models (LMs). The self-improving mechanism often employs random observation (data) sampling. However, this results in trained observation imbalance; inefficiently over-training on solved examples while under-training on challenging ones. In response, we introduce Adaptive STaR (AdaSTaR), a novel algorithm that rectifies this by integrating two adaptive sampling principles: (1) Adaptive Sampling for Diversity: promoting balanced training across observations, and (2) Adaptive Sampling for Curriculum: dynamically adjusting data difficulty to match the model's evolving strength. Across six benchmarks, AdaSTaR achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% against an extensive list of baselines. These improvements in performance and efficiency generalize to different pre-trained LMs and larger models, paving the way for more efficient and effective self-improving LMs.
摘要
自教导推理器(STaR),亦称拒绝采样微调(RFT),是自改进推理语言模型(LMs)训练流程的核心组成部分。传统的自改进机制通常采用随机观测(数据)采样,但会导致训练观测不平衡:低效地过度训练已解决的简单样本,而对具有挑战性的样本训练不足。为此,我们提出自适应STaR(AdaSTaR),该创新算法通过整合两项自适应采样原则解决这一问题:(1)多样性自适应采样:促进观测数据的平衡训练;(2)课程自适应采样:动态调整数据难度以匹配模型演化的能力。在六项基准测试中,AdaSTaR在所有案例(6/6)中均取得最佳测试准确率,相较于广泛基线方法平均降低58.6%的训练FLOPs。这些性能与效率的提升可泛化至不同预训练LMs及更大模型,为更高效、更有效的自改进LMs开辟了新路径。
SC4ANM: Identifying Optimal Section Combinations for Automated Novelty Prediction in Academic Papers
Abstract
arXiv:2505.16330v1 Announce Type: cross Abstract: Novelty is a core component of academic papers, and there are multiple perspectives on the assessment of novelty. Existing methods often focus on word or entity combinations, which provide limited insights. The content related to a paper's novelty is typically distributed across different core sections, e.g., Introduction, Methodology and Results. Therefore, exploring the optimal combination of sections for evaluating the novelty of a paper is important for advancing automated novelty assessment. In this paper, we utilize different combinations of sections from academic papers as inputs to drive language models to predict novelty scores. We then analyze the results to determine the optimal section combinations for novelty score prediction. We first employ natural language processing techniques to identify the sectional structure of academic papers, categorizing them into introduction, methods, results, and discussion (IMRaD). Subsequently, we used different combinations of these sections (e.g., introduction and methods) as inputs for pretrained language models (PLMs) and large language models (LLMs), employing novelty scores provided by human expert reviewers as ground truth labels to obtain prediction results. The results indicate that using introduction, results and discussion is most appropriate for assessing the novelty of a paper, while the use of the entire text does not yield significant results. Furthermore, based on the results of the PLMs and LLMs, the introduction and results appear to be the most important section for the task of novelty score prediction. The code and dataset for this paper can be accessed at https://github.com/njust-winchy/SC4ANM.
摘要
新颖性是学术论文的核心要素,其评估存在多种视角。现有方法多聚焦于词语或实体组合,但提供的见解有限。与论文新颖性相关的内容通常分布于不同核心章节(如引言、方法与结果),因此探索评估论文新颖性的最优章节组合对推进自动化新颖性评估具有重要意义。本文采用学术论文不同章节组合作为输入驱动语言模型预测新颖性评分,通过结果分析确定最优章节组合方案。我们首先运用自然语言处理技术识别论文章节结构(分为引言、方法、结果与讨论的IMRaD结构),随后以不同章节组合(如引言+方法)作为预训练语言模型(PLMs)和大语言模型(LLMs)的输入,以专家评审提供的新颖性评分为真实标签获取预测结果。研究表明,采用引言、结果与讨论三部分的组合最适合评估论文新颖性,而全文使用并未产生显著效果。此外,基于PLMs和LLMs的实验结果表明,引言和结果两个章节在新颖性评分预测任务中最为重要。本文代码与数据集详见https://github.com/njust-winchy/SC4ANM。
AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training
Abstract
arXiv:2505.16363v1 Announce Type: cross Abstract: We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training. By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates. Hence, AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance. Moreover, AdamS is easy to adopt: it can directly inherit hyperparameters of AdamW, and is entirely model-agnostic, integrating seamlessly into existing pipelines without modifications to optimizer APIs or architectures. The motivation behind AdamS stems from the observed smoothness properties in transformer objectives, where local smoothness is governed by gradient magnitudes that can be further approximated by momentum magnitudes. We establish rigorous theoretical convergence guarantees and provide practical guidelines for hyperparameter selection. Empirically, AdamS demonstrates strong performance in various tasks, including pre-training runs on GPT-2 and Llama2 (up to 13B parameters) and reinforcement learning in post-training regimes. With its efficiency, simplicity, and theoretical grounding, AdamS stands as a compelling alternative to existing optimizers.
摘要
我们提出AdamS——一种简单而有效的优化器替代方案,适用于大规模语言模型(LLM)的预训练与后训练场景。该方法通过采用新颖的分母项(即动量与当前梯度平方加权和的平方根),消除了对二阶矩估计的需求。因此AdamS具有高效特性,在保持与带动量随机梯度下降(SGD)相同内存和计算开销的同时,提供了更优的优化性能。该方案具备即插即用特性:可直接继承AdamW的超参数设置,且完全与模型无关,无需修改优化器API或架构即可无缝集成至现有流程。AdamS的设计动机源于Transformer目标函数中观测到的平滑特性,其中局部平滑度由梯度幅值决定,而该幅值可进一步通过动量幅值近似。我们建立了严格的理论收敛保证,并提供了超参数选择的实践指南。实验表明,AdamS在多项任务中表现优异,包括GPT-2和Llama2(最高130亿参数)的预训练,以及后训练阶段的强化学习。凭借其高效性、简洁性和理论完备性,AdamS成为现有优化器的有力替代方案。
Resource for Error Analysis in Text Simplification: New Taxonomy and Test Collection
Abstract
arXiv:2505.16392v1 Announce Type: cross Abstract: The general public often encounters complex texts but does not have the time or expertise to fully understand them, leading to the spread of misinformation. Automatic Text Simplification (ATS) helps make information more accessible, but its evaluation methods have not kept up with advances in text generation, especially with Large Language Models (LLMs). In particular, recent studies have shown that current ATS metrics do not correlate with the presence of errors. Manual inspections have further revealed a variety of errors, underscoring the need for a more nuanced evaluation framework, which is currently lacking. This resource paper addresses this gap by introducing a test collection for detecting and classifying errors in simplified texts. First, we propose a taxonomy of errors, with a formal focus on information distortion. Next, we introduce a parallel dataset of automatically simplified scientific texts. This dataset has been human-annotated with labels based on our proposed taxonomy. Finally, we analyze the quality of the dataset, and we study the performance of existing models to detect and classify errors from that taxonomy. These contributions give researchers the tools to better evaluate errors in ATS, develop more reliable models, and ultimately improve the quality of automatically simplified texts.
摘要
公众常接触复杂文本却因时间或专业限制难以充分理解,导致错误信息传播。自动文本简化(ATS)技术虽能提升信息可及性,但其评估方法未能跟上文本生成技术的进步,尤其在大语言模型(LLMs)时代更为凸显。最新研究表明,现有ATS评估指标与错误出现率缺乏相关性。人工检查进一步揭示了多样化的错误类型,这凸显出现有评估框架缺乏对错误的精细分类能力。本资源论文通过构建简化文本错误检测与分类测试集来填补这一空白。首先,我们提出以信息失真为核心的形式化错误分类体系;其次,引入经自动简化的平行科学文本数据集,该数据集已基于我们的分类体系进行人工标注;最后,我们分析数据集质量,并评估现有模型在该分类体系下的错误检测与分类性能。这些成果为研究者提供了更完善的ATS错误评估工具,有助于开发更可靠的模型,最终提升自动简化文本的质量。
Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
Abstract
arXiv:2505.16415v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models.
摘要
检索增强生成(RAG)通过将大语言模型(LLMs)与外部上下文结合,提升了生成响应的准确性与可靠性。然而,由于现有方法计算成本高昂(通常需要大量微调或人工标注),如何可靠地将生成内容归因于特定上下文片段(即上下文归因)仍具挑战性。本研究提出了一种基于Jensen-Shannon散度的新型上下文归因方法(ARC-JSD),无需额外微调或代理建模即可高效精准地识别关键上下文句子。通过在TyDi QA、Hotpot QA和Musique等多种RAG基准测试中使用不同规模的指令调优LLMs进行评估,本方法相较于先前基于代理的方法展现出更高的准确性及显著的计算效率提升。此外,机制分析揭示了负责上下文归因的特定注意力头和多层感知机(MLP)层,为理解RAG模型的内部工作机制提供了重要见解。
SATURN: SAT-based Reinforcement Learning to Unleash Language Model Reasoning
Abstract
arXiv:2505.16368v1 Announce Type: cross Abstract: How to design reinforcement learning (RL) tasks that effectively unleash the reasoning capability of large language models (LLMs) remains an open question. Existing RL tasks (e.g., math, programming, and constructing reasoning tasks) suffer from three key limitations: (1) Scalability. They rely heavily on human annotation or expensive LLM synthesis to generate sufficient training data. (2) Verifiability. LLMs' outputs are hard to verify automatically and reliably. (3) Controllable Difficulty. Most tasks lack fine-grained difficulty control, making it hard to train LLMs to develop reasoning ability from easy to hard. To address these limitations, we propose Saturn, a SAT-based RL framework that uses Boolean Satisfiability (SAT) problems to train and evaluate LLM reasoning. Saturn enables scalable task construction, rule-based verification, and precise difficulty control. Saturn designs a curriculum learning pipeline that continuously improves LLMs' reasoning capability by constructing SAT tasks of increasing difficulty and training LLMs from easy to hard. To ensure stable training, we design a principled mechanism to control difficulty transitions. We introduce Saturn-2.6k, a dataset of 2,660 SAT problems with varying difficulty. It supports the evaluation of how LLM reasoning changes with problem difficulty. We apply Saturn to DeepSeek-R1-Distill-Qwen and obtain Saturn-1.5B and Saturn-7B. We achieve several notable results: (1) On SAT problems, Saturn-1.5B and Saturn-7B achieve average pass@3 improvements of +14.0 and +28.1, respectively. (2) On math and programming tasks, Saturn-1.5B and Saturn-7B improve average scores by +4.9 and +1.8 on benchmarks (e.g., AIME, LiveCodeBench). (3) Compared to the state-of-the-art (SOTA) approach in constructing RL tasks, Saturn achieves further improvements of +8.8%. We release the source code, data, and models to support future research.
摘要
如何设计能有效释放大语言模型(LLM)推理能力的强化学习(RL)任务仍是一个开放性问题。现有RL任务(如数学、编程和构建推理任务)存在三个关键缺陷:(1)可扩展性。它们严重依赖人工标注或昂贵的LLM合成来生成足够训练数据。(2)可验证性。LLM的输出难以自动可靠地验证。(3)难度可控性。大多数任务缺乏细粒度难度控制,难以实现LLM从易到难的推理能力培养。
为此,我们提出Saturn——基于布尔可满足性问题(SAT)的RL框架,通过SAT问题训练和评估LLM推理。Saturn支持可扩展的任务构建、基于规则的验证和精确难度控制。该框架设计了课程学习流程,通过构建难度递增的SAT任务,实现LLM从易到难的推理能力持续提升。为确保训练稳定性,我们设计了控制难度迁移的原则性机制。
我们发布Saturn-2.6k数据集,包含2,660个不同难度的SAT问题,支持评估LLM推理能力随问题难度的变化规律。将Saturn应用于DeepSeek-R1-Distill-Qwen后,我们获得Saturn-1.5B和Saturn-7B模型,取得以下成果:(1)在SAT问题上,二者pass@3指标分别平均提升+14.0和+28.1;(2)在数学和编程任务中,于AIME、LiveCodeBench等基准测试平均分分别提升+4.9和+1.8;(3)相比当前最先进的RL任务构建方法,Saturn实现额外+8.8%的提升。我们公开源代码、数据及模型以支持后续研究。
Sparse Activation Editing for Reliable Instruction Following in Narratives
Abstract
arXiv:2505.16505v1 Announce Type: cross Abstract: Complex narrative contexts often challenge language models' ability to follow instructions, and existing benchmarks fail to capture these difficulties. To address this, we propose Concise-SAE, a training-free framework that improves instruction following by identifying and editing instruction-relevant neurons using only natural language instructions, without requiring labelled data. To thoroughly evaluate our method, we introduce FreeInstruct, a diverse and realistic benchmark of 1,212 examples that highlights the challenges of instruction following in narrative-rich settings. While initially motivated by complex narratives, Concise-SAE demonstrates state-of-the-art instruction adherence across varied tasks without compromising generation quality.
摘要
复杂叙事语境常常挑战语言模型遵循指令的能力,而现有基准测试未能捕捉这些困难。为此,我们提出Concise-SAE框架——一种无需训练的方法,仅通过自然语言指令即可识别并编辑与指令相关的神经元,无需标注数据即可提升指令遵循性能。为全面评估该方法,我们构建了FreeInstruct基准测试,包含1,212个多样化真实案例,突出展现叙事丰富场景中指令遵循的挑战。虽然最初针对复杂叙事设计,但Concise-SAE在各类任务中均展现出最先进的指令遵循能力,且不影响生成质量。
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
Abstract
arXiv:2505.16400v1 Announce Type: cross Abstract: Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of frontier models, such as DeepSeek-R1, including data curation strategies and RL training recipe, are often omitted. Moreover, recent research indicates distillation remains more effective than RL for smaller models. In this work, we demonstrate that large-scale RL can significantly enhance the reasoning capabilities of strong, small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models. We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks (e.g., +14.6% / +17.2% on AIME 2025 for the 7B / 14B models), but also code reasoning tasks (e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). In addition, extended code-only RL iterations further improve performance on code benchmarks with minimal or no degradation in math results. We develop a robust data curation pipeline to collect challenging prompts with high-quality, verifiable answers and test cases to enable verification-based RL across both domains. Finally, we identify key experimental insights, including curriculum learning with progressively increasing response lengths and the stabilizing effect of on-policy parameter updates. We find that RL not only elicits the foundational reasoning capabilities acquired during pretraining and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model's reasoning ability, enabling it to solve problems that were previously unsolvable.
摘要
尽管大规模强化学习(RL)在推理领域取得进展,但构建高性能推理模型的训练方案仍不明确。前沿模型(如DeepSeek-R1)的关键实现细节——包括数据筛选策略和RL训练方案——常被忽略。此外,近期研究表明对于较小模型,蒸馏法仍比RL更有效。本研究证明,大规模RL能显著增强中小型强模型的推理能力,其效果超越基于蒸馏的最先进模型。我们通过大量消融实验系统研究RL训练过程,提出一种简单有效的方法:先在纯数学提示上训练,再在纯代码提示上训练。值得注意的是,纯数学RL不仅显著提升强蒸馏模型在数学基准上的表现(例如7B/14B模型在AIME 2025上分别提升14.6%/17.2%),还能提升代码推理任务表现(例如7B/14B模型在LiveCodeBench上分别提升6.8%/5.8%)。此外,延长纯代码RL训练可进一步提升代码基准性能,同时数学结果仅有微小下降或保持稳定。我们开发了稳健的数据筛选流程,用于收集具有高质量可验证答案和测试用例的挑战性提示,以实现跨领域的基于验证的RL。最后,我们发现了关键实验洞见,包括响应长度渐进增加的课程学习策略和同策略参数更新的稳定效果。研究表明,RL不仅能激发预训练和监督微调(如蒸馏)中获得的基础推理能力,更能突破模型原有推理极限,使其解决此前无法解决的问题。
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
Abstract
arXiv:2505.16410v1 Announce Type: cross Abstract: Recently, large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL). However, leveraging the RL algorithm to empower effective multi-tool collaborative reasoning in LLMs remains an open challenge. In this paper, we introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning. Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training. To address the scarcity of tool-use data, we propose a general tool-integrated reasoning data synthesis pipeline, which combines tool-integrated prompting with hint-based sampling to automatically and scalably generate tool-use trajectories. A subsequent quality normalization and difficulty-aware classification process filters out low-quality samples and organizes the dataset from easy to hard. Furthermore, we propose a two-stage training framework to enhance multi-tool collaborative reasoning by: (1) cold-start fine-tuning, which guides LLMs to explore reasoning patterns via tool-invocation feedback; and (2) a multi-tool self-critic RL algorithm with hierarchical reward design, which reinforces reward understanding and promotes effective tool collaboration. Experimental analyses on over 10 challenging reasoning benchmarks highlight the effectiveness and efficiency of Tool-Star. The code is available at https://github.com/dongguanting/Tool-Star.
摘要
近期,大规模语言模型(LLMs)通过大规模强化学习(RL)展现出卓越的推理能力。然而,如何利用RL算法实现LLMs中多工具协同推理的有效赋能仍是一个开放性问题。本文提出Tool-Star——一个基于RL的框架,旨在使LLMs能够在逐步推理过程中自主调用多个外部工具。该框架整合了六类工具,并在数据合成与训练中采用系统性设计。针对工具使用数据稀缺的问题,我们提出通用工具集成推理数据合成流程,通过工具集成提示与基于提示的采样相结合,实现自动化、可扩展的工具使用轨迹生成。后续的质量归一化与难度感知分类流程可过滤低质量样本,并将数据集按难度由易至难组织。此外,我们提出两阶段训练框架以增强多工具协同推理能力:(1)冷启动微调阶段,通过工具调用反馈引导LLMs探索推理模式;(2)采用分层奖励设计的"多工具自批判"RL算法,强化奖励理解并促进有效工具协作。在超过10个高难度推理基准上的实验分析验证了Tool-Star的有效性与高效性。代码已开源:https://github.com/dongguanting/Tool-Star。
Human-like Semantic Navigation for Autonomous Driving using Knowledge Representation and Large Language Models
Abstract
arXiv:2505.16498v1 Announce Type: cross Abstract: Achieving full automation in self-driving vehicles remains a challenge, especially in dynamic urban environments where navigation requires real-time adaptability. Existing systems struggle to handle navigation plans when faced with unpredictable changes in road layouts, spontaneous detours, or missing map data, due to their heavy reliance on predefined cartographic information. In this work, we explore the use of Large Language Models to generate Answer Set Programming rules by translating informal navigation instructions into structured, logic-based reasoning. ASP provides non-monotonic reasoning, allowing autonomous vehicles to adapt to evolving scenarios without relying on predefined maps. We present an experimental evaluation in which LLMs generate ASP constraints that encode real-world urban driving logic into a formal knowledge representation. By automating the translation of informal navigation instructions into logical rules, our method improves adaptability and explainability in autonomous navigation. Results show that LLM-driven ASP rule generation supports semantic-based decision-making, offering an explainable framework for dynamic navigation planning that aligns closely with how humans communicate navigational intent.
摘要
实现自动驾驶车辆的完全自动化仍面临挑战,尤其在动态城市环境中,导航需要实时适应能力。现有系统由于高度依赖预定义地图信息,在遇到道路布局不可预测变化、突发绕行或地图数据缺失时难以处理导航规划。本研究探索利用大型语言模型将非正式导航指令转化为基于逻辑的结构化推理,从而生成答案集编程规则。ASP提供的非单调推理能力使自动驾驶车辆无需依赖预设地图即可适应动态场景。我们通过实验评估表明,LLMs生成的ASP约束能将真实城市驾驶逻辑编码为形式化知识表示。通过自动化转换非正式导航指令为逻辑规则,本方法提升了自主导航的适应性与可解释性。结果表明,LLM驱动的ASP规则生成支持基于语义的决策制定,为动态导航规划提供了与人类导航意图表达高度契合的可解释框架。
LLaMAs Have Feelings Too: Unveiling Sentiment and Emotion Representations in LLaMA Models Through Probing
Abstract
arXiv:2505.16491v1 Announce Type: cross Abstract: Large Language Models (LLMs) have rapidly become central to NLP, demonstrating their ability to adapt to various tasks through prompting techniques, including sentiment analysis. However, we still have a limited understanding of how these models capture sentiment-related information. This study probes the hidden layers of Llama models to pinpoint where sentiment features are most represented and to assess how this affects sentiment analysis. Using probe classifiers, we analyze sentiment encoding across layers and scales, identifying the layers and pooling methods that best capture sentiment signals. Our results show that sentiment information is most concentrated in mid-layers for binary polarity tasks, with detection accuracy increasing up to 14% over prompting techniques. Additionally, we find that in decoder-only models, the last token is not consistently the most informative for sentiment encoding. Finally, this approach enables sentiment tasks to be performed with memory requirements reduced by an average of 57%. These insights contribute to a broader understanding of sentiment in LLMs, suggesting layer-specific probing as an effective approach for sentiment tasks beyond prompting, with potential to enhance model utility and reduce memory requirements.
摘要
大型语言模型(LLMs)已迅速成为自然语言处理的核心,通过提示技术(包括情感分析)展示了其适应各种任务的能力。然而,我们对这些模型如何捕捉情感相关信息仍知之甚少。本研究探究了Llama模型的隐藏层,以确定情感特征最集中的位置,并评估其对情感分析的影响。通过探针分类器,我们分析了不同层和规模下的情感编码,识别出最能捕捉情感信号的层和池化方法。结果表明,在二元极性任务中,情感信息最集中在中层,检测准确率较提示技术最高可提升14%。此外,我们发现仅解码器模型中,最后一个标记并非始终是情感编码信息量最大的部分。最终,该方法使情感任务的内存需求平均降低57%。这些发现深化了对LLMs中情感机制的理解,提出层特异性探针可作为超越提示技术的有效情感任务处理方案,并具备提升模型效用和降低内存需求的潜力。
Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning
Abstract
arXiv:2505.16483v1 Announce Type: cross Abstract: Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to improve the faithfulness of LLMs in both short-form and long-form generation tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tasks to construct high-quality and easily verifiable training data without human annotation. Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data, while simultaneously optimizing both short-form and long-form response generation. Notably, Dual-GRPO eliminates the need to manually label preference data to train reward models and avoids over-optimizing short-form generation when relying only on the synthesized short-form QA data. Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.
摘要
教导大型语言模型(LLM)在给定上下文中保持忠实性,对于构建可靠的信息检索系统至关重要。为此,我们提出了一个系统化框架CANOE,旨在无需人工标注的情况下提升LLM在短文本和长文本生成任务中的忠实性。具体而言,我们首先通过四项多样化任务合成短文本问答(QA)数据,从而构建高质量且易于验证的无标注训练数据。此外,我们提出了Dual-GRPO——一种基于规则的强化学习方法,该方法包含三种源自合成短文本QA数据的定制化规则奖励,同时优化短文本和长文本响应生成。值得注意的是,Dual-GRPO无需手动标注偏好数据来训练奖励模型,也避免了仅依赖合成短文本QA数据时对短文本生成的过度优化。实验结果表明,CANOE在11项不同下游任务中显著提升了LLM的忠实性,其表现甚至超越了最先进的LLM(如GPT-4o和OpenAI o1)。
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
Abstract
arXiv:2505.16416v1 Announce Type: cross Abstract: Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to large vision-language models (LVLMs), its variants introduce unintended cross-modal positional biases. Specifically, they enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments. This issue arises because image tokens representing the same content but located at different spatial positions are assigned distinct positional biases, leading to inconsistent cross-modal associations. To address this, we propose Per-Token Distance (PTD) - a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory orthogonal to the linear path of text token indices, forming a cone-like structure. This configuration ensures that each text token maintains an equal distance to all image tokens, reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered layer strategy that applies different RoPE variants across layers. This design leverages the complementary strengths of each RoPE variant, thereby enhancing the model's overall performance. Our experimental results demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for LVLMs. The code is available at https://github.com/lose4578/CircleRoPE.
摘要
旋转位置编码(RoPE)是大语言模型(LLMs)中广泛采用的相对位置信息编码技术。然而当扩展至大视觉语言模型(LVLMs)时,其变体会引入非预期的跨模态位置偏差。具体表现为:这些变体会强制建立文本标记索引与图像标记之间的相对位置依赖关系,从而导致虚假对齐。该问题的根源在于,代表相同内容但位于不同空间位置的图像标记会被赋予不同的位置偏差,最终产生不一致的跨模态关联。为解决这一问题,我们提出"单标记距离"(PTD)——一种简单有效的量化跨模态位置编码独立性的指标。基于此分析,我们提出Circle-RoPE编码方案:将图像标记索引映射到与文本标记索引线性轨迹正交的圆形路径上,形成锥形结构。这种配置确保每个文本标记与所有图像标记保持等距,在保留图像内空间信息的同时减少人为跨模态偏差。为进一步提升性能,我们提出交错层策略——在不同网络层应用不同的RoPE变体。该设计能充分发挥各RoPE变体的互补优势,从而提升模型整体性能。实验结果表明,我们的方法在有效保留图像空间信息的同时降低了相对位置偏差,为LVLMs提供了更鲁棒、更灵活的位置编码框架。代码已开源于https://github.com/lose4578/CircleRoPE。
Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing
Abstract
arXiv:2505.16522v1 Announce Type: cross Abstract: Despite significant progress, recent studies have indicated that current large language models (LLMs) may still utilize bias during inference, leading to the poor generalizability of LLMs. Some benchmarks are proposed to investigate the generalizability of LLMs, with each piece of data typically containing one type of controlled bias. However, a single piece of data may contain multiple types of biases in practical applications. To bridge this gap, we propose a multi-bias benchmark where each piece of data contains five types of biases. The evaluations conducted on this benchmark reveal that the performance of existing LLMs and debiasing methods is unsatisfying, highlighting the challenge of eliminating multiple types of biases simultaneously. To overcome this challenge, we propose a causal effect estimation-guided multi-bias elimination method (CMBE). This method first estimates the causal effect of multiple types of biases simultaneously. Subsequently, we eliminate the causal effect of biases from the total causal effect exerted by both the semantic information and biases during inference. Experimental results show that CMBE can effectively eliminate multiple types of bias simultaneously to enhance the generalizability of LLMs.
摘要
尽管取得了显著进展,近期研究表明当前大规模语言模型(LLM)在推理过程中仍可能利用偏见,导致模型泛化能力较差。现有研究提出了一些基准来考察LLM的泛化能力,其中每条数据通常仅包含一种受控偏见类型。然而在实际应用中,单条数据可能同时存在多种偏见类型。为填补这一空白,我们提出了一个多偏见基准数据集,其中每条数据包含五种偏见类型。在该基准上的评估表明,现有LLM及去偏见方法的性能表现欠佳,这凸显了同时消除多种偏见类型的挑战性。为解决这一难题,我们提出了一种因果效应估计引导的多偏见消除方法(CMBE)。该方法首先同步估计多种偏见类型的因果效应,随后在推理过程中从语义信息和偏见共同产生的总因果效应中消除偏见的因果影响。实验结果表明,CMBE能有效同步消除多种偏见类型,从而提升LLM的泛化能力。
Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs
Abstract
arXiv:2505.16520v1 Announce Type: cross Abstract: Factual hallucinations are a major challenge for Large Language Models (LLMs). They undermine reliability and user trust by generating inaccurate or fabricated content. Recent studies suggest that when generating false statements, the internal states of LLMs encode information about truthfulness. However, these studies often rely on synthetic datasets that lack realism, which limits generalization when evaluating the factual accuracy of text generated by the model itself. In this paper, we challenge the findings of previous work by investigating truthfulness encoding capabilities, leading to the generation of a more realistic and challenging dataset. Specifically, we extend previous work by introducing: (1) a strategy for sampling plausible true-false factoid sentences from tabular data and (2) a procedure for generating realistic, LLM-dependent true-false datasets from Question Answering collections. Our analysis of two open-source LLMs reveals that while the findings from previous studies are partially validated, generalization to LLM-generated datasets remains challenging. This study lays the groundwork for future research on factuality in LLMs and offers practical guidelines for more effective evaluation.
摘要
事实性幻觉是大语言模型(LLMs)面临的主要挑战。其生成的错误或虚构内容会损害可靠性和用户信任。近期研究表明,当生成虚假陈述时,LLMs的内部状态会编码真实性信息。然而这些研究通常依赖于缺乏真实性的合成数据集,限制了在评估模型生成文本事实准确性时的泛化能力。本文通过研究真实性编码能力对前人研究结论提出质疑,并由此生成更具现实性和挑战性的数据集。具体而言,我们通过以下方式扩展了先前工作:(1)提出从表格数据中采样合理真伪事实句的策略;(2)设计从问答集合生成依赖于LLMs的真实真伪数据集的流程。对两个开源LLMs的分析表明,虽然前人研究结论得到部分验证,但向LLM生成数据集的泛化仍具挑战性。本研究为LLMs事实性领域的未来研究奠定基础,并为更有效的评估提供实用指南。
DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection
Abstract
arXiv:2505.16530v1 Announce Type: cross Abstract: Large language models (LLMs) are considered valuable Intellectual Properties (IP) for legitimate owners due to the enormous computational cost of training. It is crucial to protect the IP of LLMs from malicious stealing or unauthorized deployment. Despite existing efforts in watermarking and fingerprinting LLMs, these methods either impact the text generation process or are limited in white-box access to the suspect model, making them impractical. Hence, we propose DuFFin, a novel \textbf{Du}al-Level \textbf{Fin}gerprinting \textbf{F}ramework for black-box setting ownership verification. DuFFin extracts the trigger pattern and the knowledge-level fingerprints to identify the source of a suspect model. We conduct experiments on a variety of models collected from the open-source website, including four popular base models as protected LLMs and their fine-tuning, quantization, and safety alignment versions, which are released by large companies, start-ups, and individual users. Results show that our method can accurately verify the copyright of the base protected LLM on their model variants, achieving the IP-ROC metric greater than 0.95. Our code is available at https://github.com/yuliangyan0807/llm-fingerprint.
摘要
由于训练所需的高昂计算成本,大语言模型(LLMs)被视为合法持有者的重要知识产权(IP)。保护LLMs的知识产权免受恶意窃取或未经授权部署至关重要。尽管现有研究在LLMs水印和指纹识别方面做出努力,但这些方法要么影响文本生成过程,要么仅限于对可疑模型的白盒访问,导致实用性不足。为此,我们提出DuFFin——一种面向黑盒设置所有权验证的新型双层级指纹识别框架。DuFFin通过提取触发模式和知识层级指纹来识别可疑模型的来源。我们在开源网站收集的多种模型上进行实验,包括由大型企业、初创公司及个人用户发布的四种流行基模型(作为受保护LLMs)及其微调、量化和安全对齐版本。结果表明,本方法能准确验证基模型在其变体上的版权,IP-ROC指标超过0.95。代码已开源:https://github.com/yuliangyan0807/llm-fingerprint。
CUB: Benchmarking Context Utilisation Techniques for Language Models
Abstract
arXiv:2505.16518v1 Announce Type: cross Abstract: Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) that encourage or suppress context utilisation have recently been proposed to alleviate these issues, few have seen systematic comparison. In this paper, we develop CUB (Context Utilisation Benchmark) to help practitioners within retrieval-augmented generation (RAG) identify the best CMT for their needs. CUB allows for rigorous testing on three distinct context types, observed to capture key challenges in realistic context utilisation scenarios. With this benchmark, we evaluate seven state-of-the-art methods, representative of the main categories of CMTs, across three diverse datasets and tasks, applied to nine LMs. Our results show that most of the existing CMTs struggle to handle the full set of types of contexts that may be encountered in real-world retrieval-augmented scenarios. Moreover, we find that many CMTs display an inflated performance on simple synthesised datasets, compared to more realistic datasets with naturally occurring samples. Altogether, our results show the need for holistic tests of CMTs and the development of CMTs that can handle multiple context types.
摘要
在知识密集型任务(如问答和事实核查)中,融入外部知识至关重要。然而,语言模型(LMs)可能忽略与过时参数记忆相矛盾的相关信息,或受无关上下文干扰。尽管近期提出了许多鼓励或抑制上下文利用的上下文操纵技术(CMTs)以缓解这些问题,但鲜有研究进行系统比较。本文开发了CUB(上下文利用基准测试),帮助检索增强生成(RAG)领域的实践者根据需求选择最佳CMT。CUB支持对三种不同上下文类型进行严格测试,这些类型被证实能捕捉现实上下文利用场景中的关键挑战。基于该基准,我们评估了代表CMT主要类别的七种前沿方法,涵盖三个多样化数据集和任务,并应用于九种LMs。结果表明,现有大多数CMTs难以处理现实检索增强场景中可能遇到的所有上下文类型。此外,我们发现许多CMTs在简单合成数据集上表现虚高,而在包含自然样本的更现实数据集中表现欠佳。总体而言,我们的研究结果揭示了全面测试CMTs的必要性,以及开发能处理多种上下文类型的CMTs的需求。
Steering Large Language Models for Machine Translation Personalization
Abstract
arXiv:2505.16612v1 Announce Type: cross Abstract: High-quality machine translation systems based on large language models (LLMs) have simplified the production of personalized translations reflecting specific stylistic constraints. However, these systems still struggle in settings where stylistic requirements are less explicit and might be harder to convey via prompting. We explore various strategies for personalizing LLM-generated translations in low-resource settings, focusing on the challenging literary translation domain. We explore prompting strategies and inference-time interventions for steering model generations towards a personalized style, and propose a contrastive framework exploiting latent concepts extracted from sparse autoencoders to identify salient personalization properties. Our results show that steering achieves strong personalization while preserving translation quality. We further examine the impact of steering on LLM representations, finding model layers with a relevant impact for personalization are impacted similarly by multi-shot prompting and our steering method, suggesting similar mechanism at play.
摘要
基于大语言模型(LLM)的高质量机器翻译系统简化了反映特定风格约束的个性化翻译生产。然而,在风格要求较不明确且难以通过提示传达的场景中,这些系统仍面临挑战。我们探索了在低资源环境下个性化LLM生成翻译的多种策略,重点关注具有挑战性的文学翻译领域。我们研究了引导模型生成个性化风格的提示策略和推理时干预方法,并提出一种对比框架,利用从稀疏自编码器提取的潜在概念来识别显著个性化特征。结果表明,引导方法在保持翻译质量的同时实现了强烈的个性化效果。我们进一步考察了引导对LLM表征的影响,发现与个性化相关的模型层在多示例提示和我们的引导方法下受到相似影响,暗示二者存在相似的作用机制。
Collaboration among Multiple Large Language Models for Medical Question Answering
Abstract
arXiv:2505.16648v1 Announce Type: cross Abstract: Empowered by vast internal knowledge reservoir, the new generation of large language models (LLMs) demonstrate untapped potential to tackle medical tasks. However, there is insufficient effort made towards summoning up a synergic effect from multiple LLMs' expertise and background. In this study, we propose a multi-LLM collaboration framework tailored on a medical multiple-choice questions dataset. Through post-hoc analysis on 3 pre-trained LLM participants, our framework is proved to boost all LLMs reasoning ability as well as alleviate their divergence among questions. We also measure an LLM's confidence when it confronts with adversary opinions from other LLMs and observe a concurrence between LLM's confidence and prediction accuracy.
摘要
新一代大语言模型(LLMs)凭借其庞大的内部知识储备,展现出解决医学任务的未开发潜力。然而,目前尚未充分探索如何协同利用多个LLMs的专业知识和背景以产生增效作用。本研究提出一个针对医学选择题数据集设计的多LLM协作框架。通过对3个预训练LLM参与者的后验分析,证实该框架能提升所有LLMs的推理能力,并减少它们在问题判断上的分歧。我们还测量了当LLM面对其他LLMs的反对意见时所表现出的置信度,并观察到LLM的置信度与预测准确性之间存在一致性。
Finetuning-Activated Backdoors in LLMs
Abstract
arXiv:2505.16567v1 Announce Type: cross Abstract: Finetuning openly accessible Large Language Models (LLMs) has become standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets led to predictable behaviors. In this paper, we demonstrate for the first time that an adversary can create poisoned LLMs that initially appear benign but exhibit malicious behaviors once finetuned by downstream users. To this end, our proposed attack, FAB (Finetuning-Activated Backdoor), poisons an LLM via meta-learning techniques to simulate downstream finetuning, explicitly optimizing for the emergence of malicious behaviors in the finetuned models. At the same time, the poisoned LLM is regularized to retain general capabilities and to exhibit no malicious behaviors prior to finetuning. As a result, when users finetune the seemingly benign model on their own datasets, they unknowingly trigger its hidden backdoor behavior. We demonstrate the effectiveness of FAB across multiple LLMs and three target behaviors: unsolicited advertising, refusal, and jailbreakability. Additionally, we show that FAB-backdoors are robust to various finetuning choices made by the user (e.g., dataset, number of steps, scheduler). Our findings challenge prevailing assumptions about the security of finetuning, revealing yet another critical attack vector exploiting the complexities of LLMs.
摘要
对公开可用的大型语言模型(LLM)进行微调已成为实现任务特定性能提升的标准做法。迄今为止,微调一直被视为一个可控且安全的过程,即在良性数据集上训练会产生可预测的行为。本文首次证明,攻击者可以创建被投毒的LLM,这些模型初始表现正常,但在下游用户微调后会显现恶意行为。为此,我们提出的攻击方法FAB(微调激活后门)通过元学习技术对LLM进行投毒,模拟下游微调过程,明确优化微调后模型中恶意行为的显现。同时,被投毒的LLM经过正则化处理,既保留了通用能力,又在微调前不表现出任何恶意行为。因此,当用户在自己的数据集上微调这个看似正常的模型时,会无意间触发其隐藏的后门行为。我们在多个LLM和三种目标行为(未经请求的广告推送、拒绝响应和越狱能力)上验证了FAB的有效性。此外,我们还证明FAB后门对用户不同的微调选择(如数据集、训练步长、调度器等)具有鲁棒性。这些发现挑战了当前关于微调安全性的普遍假设,揭示了利用LLM复杂性的又一关键攻击途径。
O-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering
Abstract
arXiv:2505.16582v1 Announce Type: cross Abstract: Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking a standard answer or providing non-unique and diverse answers, remain underexplored. To bridge this gap, we present O-Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open-ended and closed-ended questions in the open domain. O-Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model's sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open-ended tasks, we construct O-QA, a high-quality benchmark featuring 300 manually curated, multi-domain open-ended questions with associated web page caches. Extensive experiments show that O-Searcher, using only a 3B model, significantly surpasses leading LLM agents on O-QA. It also achieves SOTA results on various closed-ended QA benchmarks against similarly-sized models, while performing on par with much larger ones.
摘要
大型语言模型(LLMs)尽管取得了显著进展,但其静态参数化知识的固有局限性阻碍了在需要开放领域最新信息的任务上的表现。虽然让LLMs与外部知识环境交互是一种有前景的解决方案,但当前研究主要针对封闭式问题。开放式问题因缺乏标准答案或具有非唯一性、多样性答案的特点,仍未被充分探索。为弥补这一空白,我们提出O²-Searcher——一种基于强化学习的新型搜索代理,能有效处理开放域中的开放式与封闭式问题。该代理通过高效的本地模拟搜索环境实现动态知识获取,将外部世界知识与模型的复杂推理过程有效解耦。我们采用统一训练机制配合精心设计的奖励函数,使代理能识别问题类型并适配不同的答案生成策略。此外,为评估复杂开放式任务的表现,我们构建了O²-QA基准测试集,包含300个手工筛选的多领域开放式问题及关联网页缓存。大量实验表明,仅使用30亿参数的O²-Searcher在O²-QA上显著超越主流LLM代理,同时在各类封闭式QA基准测试中达到同尺寸模型的最高水平,其性能甚至可比肩更大规模的模型。
SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation
Abstract
arXiv:2505.16637v1 Announce Type: cross Abstract: Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs heavily rely on external supervision signals during training, such as human-annotated reference data or trained reward models (RMs), which are often expensive to obtain and challenging to scale. To overcome this limitation, we propose a Simple Self-Rewarding (SSR) Reinforcement Learning (RL) framework for MT that is reference-free, fully online, and relies solely on self-judging rewards. Training with SSR using 13K monolingual examples and Qwen-2.5-7B as the backbone, our model SSR-Zero-7B outperforms existing MT-specific LLMs, e.g., TowerInstruct-13B and GemmaX-28-9B, as well as larger general LLMs like Qwen2.5-32B-Instruct in English Chinese translation tasks from WMT23, WMT24, and Flores200 benchmarks. Furthermore, by augmenting SSR with external supervision from COMET, our strongest model, SSR-X-Zero-7B, achieves state-of-the-art performance in English Chinese translation, surpassing all existing open-source models under 72B parameters and even outperforming closed-source models, e.g., GPT-4o and Gemini 1.5 Pro. Our analysis highlights the effectiveness of the self-rewarding mechanism compared to the external LLM-as-a-judge approach in MT and demonstrates its complementary benefits when combined with trained RMs. Our findings provide valuable insight into the potential of self-improving RL methods. We have publicly released our code, data and models.
摘要
大型语言模型(LLMs)近期在机器翻译(MT)领域展现出卓越能力。然而,大多数先进的MT专用LLMs在训练过程中严重依赖外部监督信号,如人工标注的参考数据或训练好的奖励模型(RMs),这些资源通常成本高昂且难以扩展。为突破这一局限,我们提出一种简单自奖励(SSR)强化学习(RL)框架,该框架无需参考译文、完全在线运行,且仅依赖自我评判奖励。基于Qwen-2.5-7B模型架构,使用13K单语样本进行SSR训练后,我们的SSR-Zero-7B模型在WMT23、WMT24和Flores200基准测试的英汉互译任务中,表现优于现有MT专用LLMs(如TowerInstruct-13B和GemmaX-28-9B)以及Qwen2.5-32B-Instruct等更大规模的通用LLMs。进一步通过COMET外部监督增强SSR后,我们最强的SSR-X-Zero-7B模型实现了英汉互译的顶尖性能,超越所有72B参数以下的开源模型,甚至优于GPT-4o和Gemini 1.5 Pro等闭源模型。分析表明,与外部LLM评判机制相比,自奖励机制在MT中更具效力,且与训练好的RMs结合时能产生互补优势。这些发现为自改进RL方法的潜力提供了重要见解。我们已公开代码、数据及模型。
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
Abstract
arXiv:2505.16673v1 Announce Type: cross Abstract: In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over six widely-used reasoning benchmarks showcase the superior performance of our method. Code will be available at https://github.com/HJYao00/R1-ShareVL.
摘要
在本工作中,我们旨在通过强化学习(RL)激发多模态大语言模型(MLLMs)的推理能力,并开发一种有效方法以缓解RL过程中的稀疏奖励和优势消失问题。为此,我们提出Share-GRPO这一新型RL方法,通过在扩展问题空间中探索和共享多样化推理轨迹来解决这些问题。具体而言,Share-GRPO首先通过数据转换技术为给定问题扩展问题空间,随后鼓励MLLM在扩展问题空间上有效探索多样化推理轨迹,并在RL过程中将发现的推理轨迹在扩展问题间共享。此外,Share-GRPO还在优势计算过程中共享奖励信息,通过分层估计问题变体间和变体内的解决方案优势,从而更准确地评估相对优势并提升策略训练的稳定性。在六个广泛使用的推理基准上的大量评估证明了我们方法的优越性能。代码将在https://github.com/HJYao00/R1-ShareVL发布。